Why you can't rely on system calls to obtain limits, when running an application in a container
It's legitimate for an application to want to know how what system resources it has available -- number of CPUs, total memory, free memory, that kind of thing.
You might argue that an application should simply attempt to minimize resource usage. That's a reasonable stance for some applications, but it's often impossible to predict the resource requirements of an application in advance. Why? Because these requirements depend on external factors, such as the amount of load. A webserver, for example, will use much more CPU and memory when it's under heavy load from browsers, compared with its no-load usage.
The reality is that many general-purpose applications will need to behave differently, depending on the platform on which they run. They will need to know what limits there are on CPU, memory, and other resources.
Limits are nebulous and poorly-defined
The idea of a limit on "available memory" was well-defined up until about 1985. At that time it was simply the total size of the RAM chips in the computer. This concept started to become a bit woolly once the idea of disk swapping really took hold and, with it, the notion of 'virtual memory'. A swap file or swap partition is "available memory" in some sense, isn't it? In fact, a program might need to know both the amount of physical memory (e.g., plugged-in RAM chips) and the amount of swap space it is likely to have at its disposal.
Moreover, in a multi-processing system, the available memory might be completely different to the installed memory. After all, the installed memory -- whatever form it takes -- has to be shared between processes. So is the available memory the total installed memory? Or the fraction of the installed memory that is not currently used by other processes? Again, the application might need to know both.
So, while this article is primarily about containers, containers didn't create the problem. Many limits are simply not well-defined to start with. All that's happened is that widespread use of containers has made this problem more acute.
Programming languages don't make the situation any clearer
The best that a programming languages -- or its libraries -- will usually do is to give some vague notion of the applicable limit. For example, in most C implementations we can get the total number of physical memory pages like this:
#include <unistd.h> .. long mem_pages = sysconf (_SC_PHYS_PAGES);
The POSIX specification says that these are physical, not virtual, memory pages, but it doesn't say any more than that. It isn't clear whether this figure should include swap space (on Linux, it does not).
In particular, it isn't clear whether the limit applies to the whole
system, or to some specific container, which is where the problems
really start. Note that there is no sysconf
call to
get the 'free' memory -- a program will have to use platform-specific
techniques to get that information. On Linux, we might parse
/proc/meminfo
, for example. This pseudo-file contains
a lot of subtle memory-related data that the program
might be able to interpret -- if the programmer can.
Containers muddy the water further
Most container frameworks (Docker, podman, etc.) allow limits to be set on a per-container basis. This makes a lot more sense that relying on system-wide limits. After all, by its very nature a container framework is likely to be hosting multiple, independent containers.
On Linux, container frameworks typically use control groups ("cgroups") to impose limits. We can limit memory, CPU, and other things. The problem is that, although we can impose limits, containers don't have any better way to find out what the limits actually are.
For example, suppose I run a podman container on Linux with a memory
limit of 500Mb. Within the container, I run free
:
# free total used free shared buff/cache available Mem: 32239580 1973844 26679232 250976 3586504 29930888
This first figure -- 32 Gb -- is the total installed RAM on my system.
The limit applied to the container is nowhere evident. If an application
in the container configured itself on the basis that 32 Gb was available,
it would have a nasty surprise. The value returned by free
is simply
obtained from proc/meminfo
, and is a system limit,
not a container
limit.
Because I know that podman uses cgroups for memory control, and I know how the cgroups configuration looks from inside the container, I can actually find the total memory available to the container like this:
# cat /sys/fs/cgroup/memory/memory.limit_in_bytes 524288000
The memory available is correctly reported as 512Mb. Similar considerations apply to other resources, like CPU.
Java tries to tackle the problem
Life is potentially a little easier for Java programmers. That's because, since JDK 1.8, the Java JVM has used various heuristics to determine what kind of container framework it is running in (if any), and to provide container-specific limits to applications.
These methods are only heuristics, however. They do work with Docker and podman -- at least at present -- but whether they work with other kinds of containers, I don't know. They work -- where they work -- by parsing cgroups information, so they won't work with a container that does not use cgroups (happily, all current mainstream container frameworks do).
Because Java's container support is not foolproof, there's a command-line
switch to turn it off: -XX:-UseContainerSupport
.
When it's turned off, the JVM reverts to using system limits from /proc
, etc.
Fundamentally, containers are not virtual machines
It's not unusual to hear container frameworks referred to as "lightweight virtual machines" and, to some extent, that's a helpful term. A container has a private filesystem, its own network interfaces, and perhaps its own user identities. It certainly looks like a virtual machine in certain lights.
But it isn't. A container framework does not virtualize the kernel. Information retrieved from the kernel will be the same in any container, and the same as the host.
There is no solution to this problem
Ideally, an application should not need to know what resource limits apply to it. If it does, however, it can't rely on values obtained by simple interrogation of the operating system. There are really only two ways to deal with this problem, neither of which is really a solution:
1. Incorporate heuristics that interpret information provided by popular container frameworks. This is done automatically by Java, but will need to be coded for other languages (so far as I know).
2. Provide configuration settings by which installers and users can specify the limits that apply. These settings might be simple command-line switches, or entries in a configuration file, or something else. The installer or user is responsible for setting values that match the limits imposed by the container framework.