Container from scratch: Using cgroups to manage process resources
This first article in the "container from scratch" series describes how kernel control groups allow resources to be managed in a collection of containers.
The ability to control the resources -- CPU, memory, threads, etc -- assigned to a group of processes is a central feature of container-based operation. As a minimum, we need to be able to prevent one container starving another of resources. More subtly, we may want to give some containers priority over other, according to the needs of the application. This article describes how to achieve this management of resources using control groups, usually abbreviated to 'cgroups'.
In order to make the basic principles clear, I will demonstrate the
use of cgroups for control of memory allocation, using only command-line operations
on the files in /sys/fs/cgroups
.
Overview
Control groups, cgroups, are the backbone of Linux container resource management. They allow processes to be assigned to named groups, each of which has particular resource limitations. Processes can be added to groups explicitly, and we will have to do that for at least one process. However, what makes cgroups so powerful is that any threads a process creates, and any new processes it spawns, automatically become part of the same group. So, within the container, all processes will be subject to the same resource management constraints.
Many kinds of resource can be subject to regulation using cgroups. Probably the most important in container implementation are CPU share and memory. In the following demonstration I will control only memory -- mostly because it's easier to see the effects than it is with other kinds of resource.
Control groups are well documented -- man cgroups
will
provide a lot of detail, although it isn't necessarily easy to follow.
Note:
In the following, I use the symbol $ to denote a command that should be run by a regular user, and # for a command that needs to be run by a privileged user, typically root. Of course, you can usesudo
for these commands if you prefer.
Demonstration
The following demonstration shows how to restrict memory allocated to
a process and all its sub-processes. Begin by ensuring the cgroups
support is enabled, by checking that /sys/fs/cgroup
is mounted.
$ mount | grep cgroup | grep tmpfs tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
The memory
subdirectory may exist, or may not, depending
on your Linux installation. If this directory is not present, mount
it:
# mkdir /sys/fs/cgroup/memory # mount -t cgroup -o memory cgroup_memory /sys/fs/cgroup/memory
Make a new memory group into which the restricted processes will be placed:
# mkdir /sys/fs/cgroup/memory/mycontainer
Note that all these operations on /sys/fs/cgroup
require
a privileged user.
Request a memory limit of 5Mb for the mycontainer
group:
# echo 5000000 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
You can check that the actual limit like this:
# cat /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes 4997120
Note that the actual limit is a little smaller than the one requested -- the kernel will round it down to a multiple of its page size.
Now start a new terminal session (e.g., open a new terminal window). This new session will represent the container whose resources are to be restricted. In the new session, get the shell's process ID:
$ echo $$ 26900
Of course, it's highly unlike that you will see the same process ID; substitute the real process ID in the following steps.
In the original session, add the news session's shell to the
mycontainer
group, by writing it's
process ID to cgroup.procs
:
# echo 26900 > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
Check the processes in the mycontainer
group:
# cat /sys/fs/cgroup/memory/mycontainer/cgroup.procs 26900
There's just the one process, as expected. You can also check a
process's group membership using ps
:
# ps -o cgroup 26900 2:memory:/mycontainer,1:name=elogind:/1
Note that the process is in memory group mycontainer
as expected.
Although it isn't obvious in this example, when a specific process is moved into a particular control group, all threads in that process will move with it.
Now, back in the 'container' terminal, run a sub-shell:
$ bash $
and show the new shell's process ID:
$ echo $$ 27338
Again, you're unlikely to see the same process ID, so substitute yours
in the steps below. Now check the members of the group mycontainer
again:
# cat /sys/fs/cgroup/memory/mycontainer/cgroup.procs 26900 27338
Note that there are now two processes in the memory control group
-- the original
bash
shell (26900), and the sub-shell it spawned (27338).
In general, any process spawned by the original shell will be in the
same control groups -- not just for memory, but for all resources.
mycontainer
group of
processes:
# cat /sys/fs/cgroup/memory/mycontainer/memory.usage_in_bytes 3182592
We're already pretty close to the 5Mb (approx) limit.
Now copy the following C code to a file called memory_eater.c
.
// A trivial program to test how much memory a process can allocate using // malloc() before it fails. #include <stdio.h> #include <stdlib.h> int main (int argc, char **argv) { int block_size = 4096; int blocks = 1; while (1) { long total = block_size * blocks; if (malloc (block_size)) printf ("malloc() OK %lld bytes\n", total); else printf ("malloc() failed after %lld bytes\n", total); blocks++; } }
This simple program just allocates memory in 4kB blocks for as long as it can. With or without a memory resource limit, it will fail eventually.
Compile the program using gcc
:
$ gcc -o memory_eater memory_eater.c
Note that if you're running this command in the shell whose memory has been
restricted, you might find gcc
very slow, or it might even fail.
In that case, use a different session, or increase the memory
limit. The output of the compiler will be an executable called
memory_eater
.
Now, in the session whose memory has been restricted, run the 'memory eater' program. It will produce a lot of output, but it will begin and end as follows:
$ ./memory_eater malloc() OK 4096 bytes ... malloc() OK 886104064 bytes Killed
The program eventually failed, as we see. However, there are two important things to note here.
First, in my test, the allocations succeeded until 88Mb was allocated.
This might be different
on your system, but it's likely to be higher than the despite the 50 Mb limit
in the cgroup. Second, the malloc()
call
always appeared to succeed (that is, it returned a non-zero value),
even though the
process eventually got killed for lack of memory.
Both these findings are typical of Linux, when using C programs linked with
glibc
(i.e., just about every application you're likely to
encounter). The glibc
implementation of malloc()
is
optimistic: it will appear to succeed even when there is no more memory
available. The amount of memory that gets allocated before the program fails
is the result of a race between the memory_eater
program itself,
and the Linux out-of-memory monitor.
As an aside: in introductory C programming classes we usually teach
students to check the return value from malloc()
to
ensure that the allocation succeeded. You can see that this is largely
fruitless when developing for Linux -- there's actually no way to tell
simply from the malloc()
result whether the allocation
succeeded or not.
The significant point to take away from this discussion is that out-of-memory failures are often not clean -- few Linux applications will output a clear "I am out of memory" message: they will just stop. Still, the 5Mb memory limit imposed by cgroups is being respected -- it's just not clear in the behaviour of the test program.
You might find that some applications behave completely different in conditions of memory starvation: they might block, for example. If I try to run LibreOffice from the shell with the memory restriction, it just hangs. It will resume if the limit is raised sufficiently. In addition, if you're running tests of this kind on a desktop system, be aware that running a command does not necessarily do all the work in the same session. For example, if I run:
$ firefox
it might succeed, despite the memory limitation. That's because the
firefox
invocation is just signalling an existing instance of
Firefox to open a new window or tab. That existing instance is
not subject to the memory limitation.
Other resources
Similar considerations to those above apply to the regulation of
allocated CPU resources.
However, there are a number of subtleties here, and it's a lot
harder to demonstrate CPU control with simple command-line tests than
it is for memory. You can control how
many CPU cores are assigned to threads, what fraction of CPU time
is allocated, and so on. It also possible, using different controllers
to restrict the number of threads or processes that a particular
process can spawn -- as we can do with ulimit
, but
with more fine-grained control.
Summary
The foregoing demonstration illustrated how easy it is to manage the resources assigned to particular processes, which is a key feature of container management.
In practice, administrators probably won't manage control groups by
manipulating the files in /sys/fs/cgroup
directory -- there
are many tools and libraries that simplify the administrative work.
However, there's no better way to understand the implementation than
to work at this low level.
The
next article in this series
demonstrates how to use chroot
to isolate the container's
filesystems from the filesytem of the host.