Container from scratch: Using unshare to provide private namespaces

Tux logo In the previous article in this series I demonstrated how to sandbox the container's filesystem using chroot. I explained how to use a minimal Linux distribution -- Alpine -- in a way that is analogous to an 'image' in established container technologies.

The demonstration in this article builds directly on the previous one; if you're interested in following along, please bear this in mind. If you don't have the container directory and scripts from the previous article, none of what follows will make sense.

Background

Modern Linux kernels support process namespaces. A namespace is that set of specific kernel entities of a particular type that a process sees. These entities include:

the process list
network interfaces, IP numbers, routes, etc;
filesystem mount points
network identify -- host and domain names

There are other types of namespace, but these are the ones that will be manipulated in the demonstrations that follow. man unshare should give you the full list.

Although, by default, all processes run in the same namespace, it's possible to provide any process with one or more private namespaces. This is what the unshare utility does.

Demonstration

In this demonstration, I will show how to isolate the container's network identity, process table, and mount points, and make them independent of the host. Networking in general is something that requires a separate article, because it's somewhat more complicated.

In the previous article I described how to set up the container's filesystem, and to run a shell inside the container using a script:

# chroot container /bin/start.sh

Providing the container with private network identify, mount points, and process table is very straightforward; just run it under the control of unshare, like this:

# unshare -mpfu chroot container /bin/start.sh

The -f ("fork") option is crucial here. This option causes unshare to detach from the process it is running (start.sh in this case). If this isn't done, we end up with two processes both trying to be process 1 in the container, and a spiteful and confusing error message (try it).

Inside the container, run ps:

$ ps
   1 -sh

There's a single process running, and -- by definition -- its process ID is '1'. You might be wondering what the process ID really is, that is, what it is outside the container. It's not very easy to tell with a process called "sh", because there could be dozens of processes called "sh" outside the container as well the one inside. So let's run a process in the container that will be easier to find.

$ nc -l &
$ ps -ef | grep nc
   11 myuser    0:00 nc -l

nc is a utility for general network testing -- we'll need this later when discussing network configuration. For now, note that the nc process has ID 11 in the container (on this occasion).

Now look for the nc process from the host (that is, outside the container shell):

$ ps -ef | grep "nc -l"
2000      2501  2381  0 14:55 pts/1    00:00:00 nc -l

There's nothing particularly remarkable here -- processes in the container do have their counterparts in the global process table: we'd hardly expect it to be otherwise.

The process ID in the host's namespce is 2501 in this case, and the owner of the process is user 2000. There is no user 2000 in the host's /etc/passwd -- this number only maps onto a name in the container. We've already seen that the container renders the username correctly; In the host we just see the numeric process ID.

So that's processes. Now let's look at network identify.

Exit the container shell, and modify the start.sh script so that it sets the hostname before executing the shell (or anywhere else -- it doesn't matter). This is, add something like:

hostname mycontainer

'mycontainer' isn't an imaginative hostname, but it will do for purposes of demonstration.

Now start the container shell again, and check the hostname. It should be 'mycontainer' -- but the host system's hostname is unaffected.

It's worth repeating that, although unshare has provided the container with a separate network identify, it is still using the network interfaces from the host. Running ip addr list will make that clear. Although we've set the container's hostname, that hostname doesn't map onto a network interface, so it isn't good for much. For proper network isolation we will need to tackle the much more complicated subject of virtual networking -- later.

Finally, let's look at filesystem mounts.

$ mount
dev on /dev type devtmpfs (rw,seclabel,relatime,size=16132272k,...)
sys on /sys type sysfs (rw,seclabel,relatime)
proc on /proc type proc (rw,relatime)

Note that the only mounts visible are the ones made by the start.sh script. The host's mounts -- of which there are probably many more -- are not visible.

Some things to watch out for

With a combination of unshare and chroot we can get much of what is required from a container -- a private, sandboxed filesystem, and container-specific namespaces. It's possible, however, to overestimate the level of isolation provided by these techniques -- quite apart from sharing network interfaces with the host, a subject we haven't even broached yet.

In the end, a container isn't a virtual machine -- it's just a bunch of processes linked by particular cgroups and namespaces. This means, for example, that you can't get container-specific measures of CPU or I/O load, even though these figures do exist for individual processes.

For a specific example, try running cat /proc/loadavg from within the container. You'll see exactly the same values as you would if you ran this command outside the container -- even though the container's processes may be contributing little or nothing to the overall load.

This observation does not result from any limitation of the method we're using to create the container -- you'll see exactly the same thing if you log into a Docker or podman container and do the same test. This is no surprise, since these frameworks are using exactly the same technologies. What's different, though, is that when we build a container from scratch, it's much clearer what the limitations of the techniques really area.

A note on rootless containers

Until now, we've been running the container as root. I said in an earlier article that it was necessary to be a privileged user to run chroot, but that's not strictly true. We can run chroot as an ordinary user by making user of user namespaces.

'User' is one of the other namespace classes that unshare can create, and which I haven't discussed until now. In a user namespace, user and group IDs in the container are mapped onto different user IDs and groups in the host. If we use the -r switch to unshare this sets up a mapping from the user running the utility, to the user ID 0. With ID 0, chroot will succeed.

So try this:

$ unshare -Urmupf chroot container /bin/start.sh
mycontainer:/#

You'll get a bunch of error messages because, seen from inside the new user namespace, the files in the container directory do not belong to the user running the shell. You can fix this by setting the ownership of all the files in the container directory to that of the current (unprivileged) user.

This provides a way to run a container with no root privileges at all -- we don't need to become root, and we don't need any files to be owned by root (although they appear to be owned by root from inside the container). That you're running as root in the container is, essentially, an illusion -- but one that can be very useful.

So we have a truly "rootless" container, of the kind that podman can create.

Unfortunately, there are some limitations to this technique. In particular, the unshare command-line utility is not sophisticated enough to set up the kind of user and group mappings that would be needed for a genuinely useful container. The manual page describes the "-r" switch as "merely a convenience feature", which it is. It doesn't take much C code to set up more useful mappings, but I promised at the outset that we would set up a container using command-line tools only.

If you run a container as root, and without a user namespace, then root inside the container is root in the host. The other private namespaces and the chroot sandbox limit the harm that a rogue container could do, but there's still the option for the container to operate directly on devices in /dev, for example. The way to prevent this kind of problem is for the container to switch to an unprivileged user as soon into the start-up process as possible -- which is what my start.sh script does.

Unfortunately, user namepsaces are not without security problems of their own -- not because of flaws in the implementation, but because the lead developers to feel a false sense of security.