Container from scratch: Using chroot to isolate the filesystem

Tux logo In the previous article in this series I demonstrated how we could use kernel control groups to manage the resources available to groups of containers. That's fine, so far as it goes, but it doesn't create a container, just a managed group of processes. For these processes to amount to a container, they need to be self-contained -- they need to see their own processes, their own filesystem mounts, and their own network identity. Most significantly, they need to have a private filesystem, completely distinct from the host's filesystem. The container needs to amount to a sandbox, such that one container cannot influence the host, or any other container.

Overview

The canonical way to provide a private filesystem is to use the chroot utility. This utility existed long before the advent of contemporary container technology, and continues to be useful.

chroot runs a process with its root filesystem at some user-defined location in the parent process's filesystem. The chrooted child process can only see the part of the parent's filesystem below this point, which becomes the child process's '/' directory. We can think of the parent process -- the one that runs chroot -- as the 'host', and the child process -- the one that runs inside the private root -- as the 'container' process. There are ways to make the host filesystem accessible to the container process, but the default is that the container cannot see above its private root.

The general invocation of chroot looks like this:

# chroot {directory} {executable}

A particular problem with chroot is that it requires elevated privileges -- a regular user won't be able to run it. That remains the case even if the process to be executed will run perfectly happily as an unprivileged user (but see a later article in this series, where I describe the use of the unshare() system call to provide "rootless containers"). A way to simulate the behaviour of chroot as a regular user is to use fakechroot. This is a library that gets pre-loaded into a program to be executed, and which intercepts all kernel system calls that take filenames as arguments. The library rewrites the filename so that it is relative to the private root directory, and then forwards the call to the kernel.

fakechroot is invaluable when you need to provide container-like services without root access, and I've used it that way myself frequently. However, there are complications, particularly surrounding file ownership and permissions. In particular, while you can fake the location of a file or directory, you can't fake its ownership the same way. This problem can also be overcome by similar pre-load trickery, but it's fiddly.

As a result, just for simplicity, I assume in this article and the following ones that you can run chroot, as root.

There are two difficulties administrators (and container builders) typically face when using chroot (or fakechroot).

It isn't immediately obvious that executable is a path inside the private filesystem.
It's often not obvious how extensive the dependencies are, of the program we want to run in the chroot.

Consider this invocation:

# chroot /home/kevin/container /bin/sh

The process /bin/sh refers to a path inside the "container". The path in the container's private filesystem actually corresponds to a host path /home/kevin/container/bin/sh. This file, of course, needs to exist -- and all its dependencies need to exist, too. In a regular flavour of Linux, these dependencies will almost certainly include a bunch of shared objects (.so files) -- the standard C library, the dynamic linker, etc. It's often fiddly to find out what the dependencies are, except by a mixture of inspection of the binaries, and trial-and-error. If you install a package manager in the container (that is, in the private filesystem) you can let the package manager take care of dependency management -- but this often results in bloated containers. Still, storage is cheap, and administrator time often isn't, so there's a trade-off to be made.

Being able to run /bin/sh -- a shell, presumably -- isn't particularly interesting on its own. What we really want to run is a containerized service. That may require a whole bunch of libraries, supporting executables (a Java virtual machine, for example), configuration files, and data. All these files must be placed into the private filesystem, and in the correct places. "Correct" in this case means "correct as seen from within the private filesystem".

In the following demonstration, I will set up a container using a small, self-contained Linux distribution in a chroot. This distribution contains just enough utilities to configure the container, and prove that it is working as a container. This distribution -- Alpine Linux -- is often used as the basis for real-world containers, for Docker and other frameworks.

Demonstration

First, create a directory to serve as the root of the container's private filesystem. It doesn't matter where this directory is, and most of the files in it will end up being owned by root. I will name this directory, unimaginatively, 'container'.

$ mkdir container

Download the "mini root filesystem" version of Alpine Linux from the Alpine website. You'll need to choose the version that is appropriate for your architecture. I'm using version 3.12 for AMD64. By the time you read this, there may be a later release -- for the purposes of this demonstration it doesn't matter much which version you use.

Unpack the entire distribution -- it's only a few megabytes -- into the container directory, and then set all the files to be owned by root:

$ cd container
$ tar xvfz /path/to/alpine-minirootfs-3.12.0-x86_64.tar.gz
$ cd ..
# chown -R root:root container/

You should now have a minimal, but functional, Linux root filesystem in the container directory.

Now run the shell sh in a private, chrooted filesystem:

# chroot container /bin/sh -l

The "-l" ('login') switch to sh has the effect of setting the $PATH in a way that is appropriate for the container. You could just run sh, and then set $PATH from within the container instead.

Using cd and ls, navigate around the new filesystem and verify that it is completely self-contained -- you can't "cd .." out of the container into the host's filesystem.

# ls /
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr

Although the chrooted process can't see "out" of the container into the host, the host can see into the container -- to the host it's just a bunch of files in a directory called container.

If you run ps in the container, you'll note that there appear to be no processes at all -- not even the shell that is, quite evidently, running. This is a symptom of a wider problem: all the standard virtual filesystems like /dev and /proc are empty. In fact, they aren't mounted -- not in the container's filesystem, anyway. From a container perspective, whether the absence of these directories matters or not depends on how the container is going to be used. For experimental purposes it matters a lot, because it's hard to troubleshoot without being able to run fundamental utilities.

Another problem is that, inside the container, there are no users other than root. We probably don't want to run real container services as root, even when the container is completely sandboxed -- it's just asking for trouble. So we need to create at least one unprivileged user, and probably a home directory for that user.

It's probably easiest to do this setup from a script, that can be run repeatedly to initialize the container session. Here is an example, which I suggest saving as container/bin/start.sh (or, from inside the container, simply as /bin/start.sh). The script creates an unprivileged user and home directory -- if they do not already exist, mounts the usual pseudo-directories, and starts a shell as the unprivileged user.

CAUTION! This script must be run from inside the container (see below). Otherwise, it will modify the host system's user credentials, which is almost certainly not what you want.

PATH=/bin:/usr/bin:/sbin

grep myuser /etc/passwd > /dev/null
if [ $? == 0 ] ; then
  echo "myuser already exists";
else
  echo "Adding user myuser";
  echo "myuser::2000:2000:user:/home/myuser:/bin/sh" >> /etc/passwd
  mkdir -p /home/myuser
  chown -R 2000:2000 /home/myuser
fi

grep mygroup /etc/group > /dev/null
if [ $? == 0 ] ; then
  echo "mygroup already exists";
else
  echo "Adding group mygroup";
  echo "mygroup:x:2000:" >> /etc/group
fi

mount -t proc proc /proc >& /dev/null
mount -t devtmpfs dev /dev/ >& /dev/null
mount -t sys sys /sys >& /dev/null

exec su - myuser
#sh -l # Use this line if you want to run container as root

Note that, in the script, I have set the user and group ID to 2000. There's no particular significance to this number, except that I want IDs that don't exist on the host system. They won't clash if they do exist, but I want the distinction between the host and the container to be clear. I'm also assuming that the username myuser does not exist in the host system -- perhaps pick a different one if it does. Again, using an existing user won't break anything, but the whole purpose of this exercise is to see that the world looks different from inside the container.

You can run the container like this:

# chroot container /bin/start.sh
Adding user myuser
Adding group mygroup
$

All being well, you now have a session inside the container as the unprivileged user myuser. Try creating a file:

$ touch x
$ ls -l
total 4
-rw-r--r--    1 myuser   mygroup          1 Jul  9 12:27 x

Notice that, within the container, the user myuser exists but, in the host system it (presumably) does not.

If you run ps now, you'll see all the same processes, with the same process IDs, as in the host system. This, along with the common network configuration and hostname, indicates that the container is not yet properly isolated from the host. The container is sandboxed at the filesystem level, but it could still interfere with the host's processes, or make arbitrary network connections using the host's IP number. Solving these problems will be the subject of the following articles.

Before moving on, there's one other point that needs consideration, with regard to Docker, podman, et al. Our proto-container has a persistent filesystem. The filesystem layout of the container is effectively what is called an image in Docker language -- it's just an arrangement of files in directories. However, our container's filesystem can be modified, and the modifications remain in effect between invocations of any process in the container. This isn't how Docker works -- each new invocation gets a new, pristine copy of the filesystem in the image.

A simple way to simulate this behaviour would be to keep a compressed archive of the filesystem, with whatever permanent configuration is required, and unpack it into some temporary directory each time the container's process is invoked. Of course, we'd need to clean up these working copies of the filesystem at some point -- either manually or automatically.

There are also various ways we could implement the layering technology of regular container tools, as well. A simple one would be to unpack the various layers into a temporary directory, starting at the bottom layer, and moving up to the top.

I'm not going to discuss any of these issues further, because implementing solutions for them is just a matter of routine scripting.

In the next article, I will describe how we can use unshare to create a new namespace for the container, and give it an operating environment that is mostly decoupled from the host. Then we'll be well on the way to running a real container.