Container from scratch: Using chroot to isolate the filesystem
In the previous article in this series I demonstrated how we could use kernel control groups to manage the resources available to groups of containers. That's fine, so far as it goes, but it doesn't create a container, just a managed group of processes. For these processes to amount to a container, they need to be self-contained -- they need to see their own processes, their own filesystem mounts, and their own network identity. Most significantly, they need to have a private filesystem, completely distinct from the host's filesystem. The container needs to amount to a sandbox, such that one container cannot influence the host, or any other container.
Overview
The canonical way to provide a private filesystem is
to use the chroot
utility. This utility existed long
before the advent of contemporary container technology, and continues
to be useful.
chroot
runs a process with its root filesystem at some
user-defined location in the parent process's filesystem.
The chrooted child process can only see the part of
the parent's filesystem below this point, which becomes the child
process's '/'
directory. We can think of the parent process -- the one that
runs chroot
-- as the 'host', and the child process --
the one that runs inside the private root -- as the 'container'
process. There are ways to make the host filesystem accessible
to the container process, but the default is that the container
cannot see above its private root.
The general invocation of chroot
looks like this:
# chroot {directory} {executable}
A particular problem with chroot
is that it requires
elevated privileges -- a regular user won't be able to run it.
That remains the case even if the process to be executed will run
perfectly happily as an unprivileged user (but
see a later article in this series, where I describe the use of
the unshare()
system call to provide "rootless containers").
A way to simulate the
behaviour of chroot
as a regular user is to use
fakechroot.
This is a library that gets pre-loaded into a program to be
executed, and which intercepts all kernel system calls that take
filenames as arguments. The library rewrites the filename so that
it is relative to the private root directory, and then forwards the
call to the kernel.
fakechroot is invaluable when you need to provide container-like services without root access, and I've used it that way myself frequently. However, there are complications, particularly surrounding file ownership and permissions. In particular, while you can fake the location of a file or directory, you can't fake its ownership the same way. This problem can also be overcome by similar pre-load trickery, but it's fiddly.
As a result, just for simplicity, I assume in this article and the
following ones that you can run chroot
, as root
.
There are two difficulties administrators (and container builders)
typically face when using chroot
(or fakechroot).
It isn't immediately obvious that
executable
is a path inside the private filesystem.It's often not obvious how extensive the dependencies are, of the program we want to run in the chroot.
Consider this invocation:
# chroot /home/kevin/container /bin/sh
The process /bin/sh
refers to a path inside the "container".
The path in the container's private filesystem actually corresponds
to a host path /home/kevin/container/bin/sh
. This file,
of course, needs to exist -- and all its dependencies need to exist, too.
In a regular flavour of Linux, these dependencies will almost
certainly include a bunch of shared objects (.so files) -- the
standard C library, the dynamic linker, etc. It's often fiddly to find
out what the dependencies are, except by a mixture of inspection of
the binaries, and trial-and-error. If you install a package manager in
the container (that is, in the private filesystem) you can let
the package manager take care of dependency management -- but this often
results in bloated containers. Still, storage is cheap, and administrator
time often isn't, so there's a trade-off to be made.
Being able to run /bin/sh
-- a shell, presumably -- isn't
particularly interesting on its own. What we really want to run is
a containerized service. That may require a whole bunch of libraries,
supporting executables (a Java virtual machine, for example), configuration
files, and data. All these files must be placed into the private
filesystem, and in the correct places. "Correct" in this case
means "correct as seen from within the private filesystem".
In the following demonstration, I will set up a container using a small,
self-contained Linux distribution in a chroot
. This
distribution contains just enough utilities to configure the container,
and prove that it is working as a container. This distribution -- Alpine
Linux -- is often used as the basis for real-world
containers, for Docker and other frameworks.
Demonstration
First, create a directory to serve as the root of the container's
private filesystem. It doesn't matter where this directory is, and
most of the files in it will end up being owned by root
.
I will name this directory, unimaginatively, 'container'.
$ mkdir container
Download the "mini root filesystem" version of Alpine Linux from the Alpine website. You'll need to choose the version that is appropriate for your architecture. I'm using version 3.12 for AMD64. By the time you read this, there may be a later release -- for the purposes of this demonstration it doesn't matter much which version you use.
Unpack the entire distribution -- it's only a few megabytes -- into
the container
directory, and then set all the files to
be owned by root
:
$ cd container $ tar xvfz /path/to/alpine-minirootfs-3.12.0-x86_64.tar.gz $ cd .. # chown -R root:root container/
You should now have a minimal, but functional, Linux root
filesystem in the container
directory.
Now run the shell sh
in a private, chrooted filesystem:
# chroot container /bin/sh -l
The "-l" ('login') switch to sh
has the effect of
setting the $PATH
in
a way that is appropriate for the container. You could just run
sh
, and then set $PATH
from within the container
instead.
Using cd
and ls
, navigate around the new filesystem
and verify that it is completely self-contained
-- you can't "cd ..
" out of the container into the host's filesystem.
# ls / bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
Although the chrooted process can't see "out" of the container into
the host, the host can see into the container -- to the host it's just
a bunch of files in a directory called container
.
If you run ps
in the container, you'll note that there
appear to be no processes at all -- not even the shell that is,
quite evidently, running. This is a symptom of a wider problem:
all the standard virtual filesystems like /dev
and /proc
are empty. In fact, they aren't mounted --
not in the container's filesystem, anyway.
From a container perspective, whether
the absence of these directories matters or not depends on how the
container is going to be used.
For experimental purposes it matters a lot, because it's hard to
troubleshoot without being able to run fundamental utilities.
Another problem is that, inside the container, there are no users other than root. We probably don't want to run real container services as root, even when the container is completely sandboxed -- it's just asking for trouble. So we need to create at least one unprivileged user, and probably a home directory for that user.
It's probably easiest to do this setup from a script, that can be run
repeatedly to initialize the container session. Here is an example, which I suggest saving
as container/bin/start.sh
(or, from inside the
container, simply as /bin/start.sh
). The script
creates an unprivileged user and home directory -- if they
do not already exist, mounts the usual
pseudo-directories, and starts a shell as the unprivileged user.
CAUTION! This script must be run from inside the container (see below). Otherwise, it will modify the host system's user credentials, which is almost certainly not what you want.
PATH=/bin:/usr/bin:/sbin grep myuser /etc/passwd > /dev/null if [ $? == 0 ] ; then echo "myuser already exists"; else echo "Adding user myuser"; echo "myuser::2000:2000:user:/home/myuser:/bin/sh" >> /etc/passwd mkdir -p /home/myuser chown -R 2000:2000 /home/myuser fi grep mygroup /etc/group > /dev/null if [ $? == 0 ] ; then echo "mygroup already exists"; else echo "Adding group mygroup"; echo "mygroup:x:2000:" >> /etc/group fi mount -t proc proc /proc >& /dev/null mount -t devtmpfs dev /dev/ >& /dev/null mount -t sys sys /sys >& /dev/null exec su - myuser #sh -l # Use this line if you want to run container as root
Note that, in the script, I have set the user and group ID to 2000. There's
no particular significance to this number, except that I want IDs that
don't exist on the host system. They won't clash if they
do exist, but I want
the distinction between the host and the container to be clear.
I'm also assuming that the username myuser
does not
exist in the host system -- perhaps pick a different one if it does.
Again, using an
existing user won't break anything, but the whole purpose of this
exercise is to see that the world looks different from inside the container.
You can run the container like this:
# chroot container /bin/start.sh Adding user myuser Adding group mygroup $
All being well, you now have a session inside the container as the
unprivileged user myuser
. Try creating a file:
$ touch x $ ls -l total 4 -rw-r--r-- 1 myuser mygroup 1 Jul 9 12:27 x
Notice that, within the container, the user myuser
exists but, in the host system it (presumably) does not.
If you run ps
now, you'll see all the same processes, with
the same process IDs, as in the host system. This, along with the
common network configuration and hostname, indicates that the
container is not yet properly isolated from the host. The container is
sandboxed at the filesystem level, but it could still interfere with
the host's processes, or make arbitrary network connections using
the host's IP number. Solving these problems will be the subject
of the following articles.
Before moving on, there's one other point that needs consideration, with regard to Docker, podman, et al. Our proto-container has a persistent filesystem. The filesystem layout of the container is effectively what is called an image in Docker language -- it's just an arrangement of files in directories. However, our container's filesystem can be modified, and the modifications remain in effect between invocations of any process in the container. This isn't how Docker works -- each new invocation gets a new, pristine copy of the filesystem in the image.
A simple way to simulate this behaviour would be to keep a compressed archive of the filesystem, with whatever permanent configuration is required, and unpack it into some temporary directory each time the container's process is invoked. Of course, we'd need to clean up these working copies of the filesystem at some point -- either manually or automatically.
There are also various ways we could implement the layering technology of regular container tools, as well. A simple one would be to unpack the various layers into a temporary directory, starting at the bottom layer, and moving up to the top.
I'm not going to discuss any of these issues further, because implementing solutions for them is just a matter of routine scripting.
In the next article, I will describe how we can use unshare
to create a new namespace for the container, and give it an operating
environment
that is mostly decoupled from the host. Then we'll be well on the
way to running a real container.
- Previous: Using cgroups to manage process resources
- Table of contents
- Next: Using unshare to provide private namespaces