Container from scratch: Using unshare to provide private namespaces
In the previous article in this series I demonstrated how to sandbox the container's filesystem
using chroot
. I explained how to use a minimal Linux
distribution -- Alpine -- in a way that is analogous to an 'image'
in established container technologies.
The demonstration in this article builds directly on the previous one;
if you're interested in following along, please bear this in mind.
If you don't have the container
directory and scripts
from the previous article, none of what follows will make sense.
Background
Modern Linux kernels support process namespaces. A namespace is that set of specific kernel entities of a particular type that a process sees. These entities include:
the process list
network interfaces, IP numbers, routes, etc;
filesystem mount points
network identify -- host and domain names
There are other types of namespace, but these are the ones
that will be manipulated in the demonstrations that follow.
man unshare
should give you the full list.
Although, by default, all processes run in the same namespace,
it's possible to provide any process with one or more private namespaces.
This is what the unshare
utility does.
Demonstration
In this demonstration, I will show how to isolate the container's network identity, process table, and mount points, and make them independent of the host. Networking in general is something that requires a separate article, because it's somewhat more complicated.
In the previous article I described how to set up the container's filesystem, and to run a shell inside the container using a script:
# chroot container /bin/start.sh
Providing the container with private network identify, mount
points, and process table is very straightforward; just run it
under the control of unshare
, like this:
# unshare -mpfu chroot container /bin/start.sh
The -f
("fork") option is crucial here. This option
causes unshare
to detach from the process it is
running (start.sh
in this case). If this isn't done,
we end up with two processes both trying to be process 1 in the
container, and a spiteful and confusing error message (try it).
Inside the container, run ps
:
$ ps 1 -sh
There's a single process running, and -- by definition -- its process ID is '1'. You might be wondering what the process ID really is, that is, what it is outside the container. It's not very easy to tell with a process called "sh", because there could be dozens of processes called "sh" outside the container as well the one inside. So let's run a process in the container that will be easier to find.
$ nc -l & $ ps -ef | grep nc 11 myuser 0:00 nc -l
nc
is a utility for general network testing -- we'll
need this later when discussing network configuration. For now,
note that the nc
process has ID 11 in the container
(on this occasion).
Now look for the nc
process from the host
(that is, outside the container shell):
$ ps -ef | grep "nc -l" 2000 2501 2381 0 14:55 pts/1 00:00:00 nc -l
There's nothing particularly remarkable here -- processes in the container do have their counterparts in the global process table: we'd hardly expect it to be otherwise.
The process ID in the host's namespce is 2501 in this case,
and the owner of the process
is user 2000. There is no user 2000 in the host's /etc/passwd
-- this number only maps onto a name in the container. We've already
seen that the container renders the username correctly; In the host
we just see the numeric process ID.
So that's processes. Now let's look at network identify.
Exit the container shell, and modify
the start.sh
script so that it sets the
hostname before executing the shell (or anywhere else -- it
doesn't matter). This is, add something like:
hostname mycontainer
'mycontainer' isn't an imaginative hostname, but it will do for purposes of demonstration.
Now start the container shell again, and check the hostname. It should be 'mycontainer' -- but the host system's hostname is unaffected.
It's worth repeating that, although unshare
has provided
the container with a separate network identify, it is still using
the network interfaces from the host. Running ip addr list
will make that clear. Although we've set the container's hostname, that
hostname doesn't map onto a network interface, so it isn't good for much.
For proper network isolation we will need to
tackle the much more complicated subject of virtual networking
-- later.
Finally, let's look at filesystem mounts.
$ mount dev on /dev type devtmpfs (rw,seclabel,relatime,size=16132272k,...) sys on /sys type sysfs (rw,seclabel,relatime) proc on /proc type proc (rw,relatime)
Note that the only mounts visible are the ones made by the
start.sh
script. The host's mounts -- of which there
are probably many more -- are not visible.
Some things to watch out for
With a combination of unshare
and chroot
we
can get much of what is required from a container -- a private, sandboxed
filesystem, and container-specific namespaces. It's possible, however,
to overestimate the level of isolation provided by these techniques --
quite apart from sharing network interfaces with the host, a subject we haven't even broached yet.
In the end, a container isn't a virtual machine -- it's just a bunch of processes linked by particular cgroups and namespaces. This means, for example, that you can't get container-specific measures of CPU or I/O load, even though these figures do exist for individual processes.
For a specific example, try running cat /proc/loadavg
from within the container. You'll see exactly the same values
as you would if you ran this command outside the container -- even
though the container's processes may be contributing little
or nothing to the overall load.
This observation does not result from any limitation of the method we're using to create the container -- you'll see exactly the same thing if you log into a Docker or podman container and do the same test. This is no surprise, since these frameworks are using exactly the same technologies. What's different, though, is that when we build a container from scratch, it's much clearer what the limitations of the techniques really area.
A note on rootless containers
Until now, we've been running the container as root
.
I said in an earlier article that it was necessary to be
a privileged user to run chroot
, but that's not
strictly true. We can run chroot
as an ordinary user
by making user of user namespaces.
'User' is one of the other namespace classes that
unshare
can create, and which I haven't discussed
until now. In a user namespace, user and group IDs in the
container are mapped onto different user IDs and groups in the
host. If we use the -r
switch to unshare
this sets up a mapping from the user running the utility, to
the user ID 0. With ID 0, chroot
will succeed.
So try this:
$ unshare -Urmupf chroot container /bin/start.sh mycontainer:/#
You'll get a bunch of error messages because, seen from inside
the new user namespace, the files in the container
directory do not belong to the user running the shell.
You can fix this by setting the ownership of all the files in the
container
directory to that of the current (unprivileged)
user.
This provides a way to run a container with no root
privileges at all -- we don't need to become root
,
and we don't need any files to be owned by root (although they
appear to be owned by root
from inside the container).
That you're running as root
in the container is, essentially,
an illusion -- but one that can be very useful.
So we have a truly "rootless" container, of the kind that podman can create.
Unfortunately, there are some limitations to this technique.
In particular, the unshare
command-line utility
is not sophisticated enough to set up the kind of user and
group mappings that would be needed for a genuinely useful
container. The manual page describes the "-r" switch as
"merely a convenience feature", which it is. It doesn't take
much C code to set up more useful mappings, but I promised at the
outset that we would set up a container using command-line tools
only.
If you run a container as root
, and without a user namespace,
then root
inside the container is root
in the host. The other private
namespaces and the chroot
sandbox limit the harm that a
rogue container could do, but there's still the option for the container
to operate directly on devices in /dev
, for example.
The way to prevent this kind of problem is for the container to switch
to an unprivileged user as soon into the start-up process as
possible -- which is what my start.sh
script does.
Unfortunately, user namepsaces are not without security problems of their own -- not because of flaws in the implementation, but because the lead developers to feel a false sense of security.