Container from scratch: Using cgroups to manage process resources

Tux logo This first article in the "container from scratch" series describes how kernel control groups allow resources to be managed in a collection of containers.

The ability to control the resources -- CPU, memory, threads, etc -- assigned to a group of processes is a central feature of container-based operation. As a minimum, we need to be able to prevent one container starving another of resources. More subtly, we may want to give some containers priority over other, according to the needs of the application. This article describes how to achieve this management of resources using control groups, usually abbreviated to 'cgroups'.

In order to make the basic principles clear, I will demonstrate the use of cgroups for control of memory allocation, using only command-line operations on the files in /sys/fs/cgroups.

Overview

Control groups, cgroups, are the backbone of Linux container resource management. They allow processes to be assigned to named groups, each of which has particular resource limitations. Processes can be added to groups explicitly, and we will have to do that for at least one process. However, what makes cgroups so powerful is that any threads a process creates, and any new processes it spawns, automatically become part of the same group. So, within the container, all processes will be subject to the same resource management constraints.

Many kinds of resource can be subject to regulation using cgroups. Probably the most important in container implementation are CPU share and memory. In the following demonstration I will control only memory -- mostly because it's easier to see the effects than it is with other kinds of resource.

Control groups are well documented -- man cgroups will provide a lot of detail, although it isn't necessarily easy to follow.

Note:
In the following, I use the symbol $ to denote a command that should be run by a regular user, and # for a command that needs to be run by a privileged user, typically root. Of course, you can use sudo for these commands if you prefer.

Demonstration

The following demonstration shows how to restrict memory allocated to a process and all its sub-processes. Begin by ensuring the cgroups support is enabled, by checking that /sys/fs/cgroup is mounted.

$ mount | grep cgroup | grep tmpfs
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)

The memory subdirectory may exist, or may not, depending on your Linux installation. If this directory is not present, mount it:

# mkdir /sys/fs/cgroup/memory
# mount -t cgroup -o memory cgroup_memory /sys/fs/cgroup/memory

Make a new memory group into which the restricted processes will be placed:

# mkdir /sys/fs/cgroup/memory/mycontainer

Note that all these operations on /sys/fs/cgroup require a privileged user.

Request a memory limit of 5Mb for the mycontainer group:

# echo 5000000 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes

You can check that the actual limit like this:

# cat /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
4997120

Note that the actual limit is a little smaller than the one requested -- the kernel will round it down to a multiple of its page size.

Now start a new terminal session (e.g., open a new terminal window). This new session will represent the container whose resources are to be restricted. In the new session, get the shell's process ID:

$ echo $$
26900

Of course, it's highly unlike that you will see the same process ID; substitute the real process ID in the following steps.

In the original session, add the news session's shell to the mycontainer group, by writing it's process ID to cgroup.procs:

# echo 26900 > /sys/fs/cgroup/memory/mycontainer/cgroup.procs 

Check the processes in the mycontainer group:

# cat /sys/fs/cgroup/memory/mycontainer/cgroup.procs 
26900

There's just the one process, as expected. You can also check a process's group membership using ps:

# ps -o cgroup 26900
2:memory:/mycontainer,1:name=elogind:/1

Note that the process is in memory group mycontainer as expected.

Although it isn't obvious in this example, when a specific process is moved into a particular control group, all threads in that process will move with it.

Now, back in the 'container' terminal, run a sub-shell:

$ bash
$

and show the new shell's process ID:

$ echo $$
27338

Again, you're unlikely to see the same process ID, so substitute yours in the steps below. Now check the members of the group mycontainer again:

# cat /sys/fs/cgroup/memory/mycontainer/cgroup.procs 
26900
27338

Note that there are now two processes in the memory control group -- the original bash shell (26900), and the sub-shell it spawned (27338). In general, any process spawned by the original shell will be in the same control groups -- not just for memory, but for all resources.

Now check the memory used by the mycontainer group of processes:
# cat /sys/fs/cgroup/memory/mycontainer/memory.usage_in_bytes 
3182592

We're already pretty close to the 5Mb (approx) limit.

Now copy the following C code to a file called memory_eater.c.

// A trivial program to test how much memory a process can allocate using
//  malloc() before it fails.
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv)
  {
  int block_size = 4096;
  int blocks = 1;
  while (1)
    {
    long total = block_size * blocks;
    if (malloc (block_size)) 
      printf ("malloc() OK %lld bytes\n", total);
    else
      printf ("malloc() failed after %lld bytes\n", total);
    blocks++;
    }
  }

This simple program just allocates memory in 4kB blocks for as long as it can. With or without a memory resource limit, it will fail eventually.

Compile the program using gcc:

$ gcc -o memory_eater memory_eater.c

Note that if you're running this command in the shell whose memory has been restricted, you might find gcc very slow, or it might even fail. In that case, use a different session, or increase the memory limit. The output of the compiler will be an executable called memory_eater.

Now, in the session whose memory has been restricted, run the 'memory eater' program. It will produce a lot of output, but it will begin and end as follows:

$ ./memory_eater
malloc() OK 4096 bytes
...
malloc() OK 886104064 bytes
Killed

The program eventually failed, as we see. However, there are two important things to note here.

First, in my test, the allocations succeeded until 88Mb was allocated. This might be different on your system, but it's likely to be higher than the despite the 50 Mb limit in the cgroup. Second, the malloc() call always appeared to succeed (that is, it returned a non-zero value), even though the process eventually got killed for lack of memory.

Both these findings are typical of Linux, when using C programs linked with glibc (i.e., just about every application you're likely to encounter). The glibc implementation of malloc() is optimistic: it will appear to succeed even when there is no more memory available. The amount of memory that gets allocated before the program fails is the result of a race between the memory_eater program itself, and the Linux out-of-memory monitor.

As an aside: in introductory C programming classes we usually teach students to check the return value from malloc() to ensure that the allocation succeeded. You can see that this is largely fruitless when developing for Linux -- there's actually no way to tell simply from the malloc() result whether the allocation succeeded or not.

The significant point to take away from this discussion is that out-of-memory failures are often not clean -- few Linux applications will output a clear "I am out of memory" message: they will just stop. Still, the 5Mb memory limit imposed by cgroups is being respected -- it's just not clear in the behaviour of the test program.

You might find that some applications behave completely different in conditions of memory starvation: they might block, for example. If I try to run LibreOffice from the shell with the memory restriction, it just hangs. It will resume if the limit is raised sufficiently. In addition, if you're running tests of this kind on a desktop system, be aware that running a command does not necessarily do all the work in the same session. For example, if I run:

$ firefox

it might succeed, despite the memory limitation. That's because the firefox invocation is just signalling an existing instance of Firefox to open a new window or tab. That existing instance is not subject to the memory limitation.

Other resources

Similar considerations to those above apply to the regulation of allocated CPU resources. However, there are a number of subtleties here, and it's a lot harder to demonstrate CPU control with simple command-line tests than it is for memory. You can control how many CPU cores are assigned to threads, what fraction of CPU time is allocated, and so on. It also possible, using different controllers to restrict the number of threads or processes that a particular process can spawn -- as we can do with ulimit, but with more fine-grained control.

Summary

The foregoing demonstration illustrated how easy it is to manage the resources assigned to particular processes, which is a key feature of container management.

In practice, administrators probably won't manage control groups by manipulating the files in /sys/fs/cgroup directory -- there are many tools and libraries that simplify the administrative work. However, there's no better way to understand the implementation than to work at this low level.

The next article in this series demonstrates how to use chroot to isolate the container's filesystems from the filesytem of the host.