On Unsharing Namespaces, Part Two

“There is no such thing as a Docker container!”

– “Stabby” Kilgore Trout to Gaias Julius Caesar, March 15, 44 B.C.E.

This is the second installment in a riveting series. Be sure to have read the first part, which covers the uts, pid and mnt namespaces!

Namespaces
- Network
- User
Summary
References

There are more namespaces than just the ones we’re looking at in this series.

Namespaces

Network

Unsharing the net network namespace allows for the process to have its own IPv4 and IPv6 stacks, network links, firewall rules and IP routing tables (among others).

Let’s look at the difference between sharing, or inheriting, the net namespace from the parent process and unsharing it.

In the absence of the --net option to unshare, the bash program running in the forked process below will inherit the --net namespace from its parent, and we can see this by listing out the processes ns directory in /proc:

# On the host.
$ unshare bash

# In the container process.
$ ls -l /proc/$$/ns | ag net
lrwxrwxrwx 1 btoll btoll 0 Aug  9 17:52 net -> net:[4026532008]

Then, on the host, we demonstrate that pid 1 (systemd on my Debian bullseye distro) indeed has the same net namespace, which the containing process inherited:

# On the host, where PID 1 is `systemd`.
$ sudo ls -l /proc/1/exe
lrwxrwxrwx 1 root root 0 Aug  2 15:21 /proc/1/exe -> /lib/systemd/systemd
$ sudo ls -l /proc/1/ns | ag net
lrwxrwxrwx 1 root root 0 Aug  7 20:19 net -> net:[4026532008]

Further, back in the container process, since it inherited the same net namespace we can show that the new process can see all of the same namespaced network interfaces as the host and accesses the same routing table:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp2s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 80:fa:5b:53:fb:82 brd ff:ff:ff:ff:ff:ff
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e4:70:b8:b4:22:a6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.10/24 brd 192.168.1.255 scope global dynamic noprefixroute wlp3s0
       valid_lft 196879sec preferred_lft 196879sec
    inet6 fe80::2308:ab5:dc8:cdae/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
$
$ ip r
default via 192.168.1.1 dev wlp3s0 proto dhcp metric 600
169.254.0.0/16 dev wlp3s0 scope link metric 1000
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.1.0/24 dev wlp3s0 proto kernel scope link src 192.168.1.10 metric 600

Unsharing

Ok, now, let’s create a process with its own net namespace. Immediately, we can see that the net namespaces are not the same, using the same steps as before:

# On the host.
$ sudo unshare --net bash

# In the container process.
# ls -l /proc/$$/ns | ag net
lrwxrwxrwx 1 root root 0 Aug  9 17:58 net -> net:[4026533295]

# Again, on the host, where PID 1 is `systemd`.
$ sudo ls -l /proc/1/exe
lrwxrwxrwx 1 root root 0 Aug  2 15:21 /proc/1/exe -> /lib/systemd/systemd
$ sudo ls -l /proc/1/ns | ag net
lrwxrwxrwx 1 root root 0 Aug  7 20:19 net -> net:[4026532008]

More, back in the container process, we can show that the new process only has a loopback interface and has no routing table information:

# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# ip route
Error: ipv4: FIB table does not exist.
Dump terminated

Note that not giving a command to unshare will result in it opening a shell by default (determined by the value of the SHELL environment variable).

We could have omitted the bash program name from all of the above examples.

Weeeeeeeeeeeeeeeeeeeeeeeeeeee

Connectivity

Let’s now establish network connectivity between the host and the container process by creating two virtual Ethernet (veth) interfaces.

Conceptually, we can think of this as a cable that connects the default net network namespace with the new net network namespace of the container.

Here are some characteristics of veth devices (from the manpage):

they can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices
always created in interconnected pairs, such as:
```
$ sudo ip link add p1-name type veth peer name p2-name
```
- p1-name and p2-name are the names assigned to the two connected end points
packets transmitted on one device in the pair are immediately received on the other device
when either device is down the link state of the pair is down

Anyone with an interest in container networking should pay particular attention to these little fellas.

veth devices have a particularly interesting use case: placing one end of a veth pair in one network namespace and the other end in another network namespace allows for communicating between network namespaces.

Here is another example (using the netns parameter):
$ sudo ip link add p1-name netns p1-ns type veth peer p2-name netns p2-ns

We’ll start by creating the new process with its own unshared net network namespace:

$ sudo unshare --net bash

Again, we can see that the new process has its own net namespace that is distinct from the host. This time, we’ll show the difference by using the lsns command with custom columns:

# lsns --type net -o NS,PID,COMMAND | ag "systemd|bash"
4026532008       1 /lib/systemd/systemd --system --deserialize 18
4026532801 2561438 bash

As we can see from the column options passed to the output parameter (-o), the first column is the net namespace, the second the process ID and the third the command that created the process (notably, the first column has different values).

If there were no accessible namespaces, the result would be empty.

You could also list all of the namespaces of a process by passing the -p option the pid:

# lsns -p $$
        NS TYPE   NPROCS      PID USER COMMAND
4026531834 time       99        1 root /sbin/init
4026531835 cgroup     99        1 root /sbin/init
4026531836 pid        99        1 root /sbin/init
4026531837 user       99        1 root /sbin/init
4026531838 uts        97        1 root /sbin/init
4026531839 ipc        99        1 root /sbin/init
4026531840 mnt        94        1 root /sbin/init
4026532160 net         2  2561438 root bash

Let’s look at networking information. Note that there are no entries in the routing table yet in the container, and the only device is loopback:

# ip route
Error: ipv4: FIB table does not exist.
Dump terminated
#
# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Let’s now connect the new net namespace to the default net namespace with that pid. Run the following command on the host, first getting the pid of the new process which we’ll need to create its virtual network interface (we can also get it inside the container by echoing out the current process ID using a special Bash parameter):

# echo $$
2561438
$ sudo ip link add ve1 netns 2561438 type veth peer name ve2 netns 1

Let’s break that down like a hip beat:

We’re adding a new virtual Ethernet interface called ve1 and binding it to the process with ID 2561438.
- Note that we called have called this anything, it doesn’t have to be ve1. It could have been called poop1.
The type veth is a virtual Ethernet interface.
The peer keyword means that we’re joining the two new interfaces together.
We’re adding a new virtual Ethernet interface called ve2 and binding it to the process with ID 1.
- Note that we called have called this anything, it doesn’t have to be ve2. It could have been called poop2.

In the container, we can now see that the new virtual Ethernet device has indeed been added:

# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:c2:b6:f3:aa:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0

And we’ll bring it up:

# ip link set ve1 up
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether 26:c2:b6:f3:aa:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0

We’ll do the same on the host:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
...
3745: ve2@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
$
$ sudo ip link set ve2 up
$ ip a
...
3745: ve2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::9cea:69ff:fe72:e31e/64 scope link
       valid_lft forever preferred_lft forever

Bringing an interface UP means to enable it. What does LOWER_UP mean, then?

It signals that it is a physical layer link flag. LOWER_UP indicates that an Ethernet cable was plugged in and that the device is connected to the network, that is, it can send and receive encoded and decoded information from its physical medium source, be it electricity, light or radio waves.

Of course, in order to be able to send traffic to the devices, both need to be assigned an IP address on the same network.

First, in the container:

# ip addr add 192.168.1.100/24 dev ve1
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3745: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 1e:93:3b:e3:8f:32 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.100/24 scope global ve1
       valid_lft forever preferred_lft forever
    inet6 fe80::1c93:3bff:fee3:8f32/64 scope link
       valid_lft forever preferred_lft forever

As soon as the IP is associated, a route has been added to the container’s routing table (since its using CIDR notation, the kernel automatically adds a routing entry):

# ip route
192.168.1.0/24 dev ve1 proto kernel scope link src 192.168.1.100

And on the host:

$ sudo ip addr add 192.168.1.200/24 dev ve2
$ ip a
...
3745: ve2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet 192.168.1.200/24 scope global ve2
       valid_lft forever preferred_lft forever
    inet6 fe80::9cea:69ff:fe72:e31e/64 scope link
       valid_lft forever preferred_lft forever

Likewise, when the IP address was associated with the host’s virtual Ethernet device, a new route was added to its routing table (again, because CIDR notation was used):

$ ip route
default via 192.168.1.1 dev wlp3s0 proto dhcp metric 20600
...
192.168.1.0/24 dev ve2 proto kernel scope link src 192.168.1.200
192.168.1.0/24 dev wlp3s0 proto kernel scope link src 192.168.1.10 metric 600

Let’s test it!

In the container:

# ping -c4 192.168.1.200
PING 192.168.1.200 (192.168.1.200) 56(84) bytes of data.
64 bytes from 192.168.1.200: icmp_seq=1 ttl=64 time=0.097 ms
64 bytes from 192.168.1.200: icmp_seq=2 ttl=64 time=0.095 ms
64 bytes from 192.168.1.200: icmp_seq=3 ttl=64 time=0.096 ms
64 bytes from 192.168.1.200: icmp_seq=4 ttl=64 time=0.094 ms

--- 192.168.1.200 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3067ms
rtt min/avg/max/mdev = 0.094/0.095/0.097/0.001 ms

On the host:

$ ping -c4 192.168.1.100
PING 192.168.1.100 (192.168.1.100) 56(84) bytes of data.
64 bytes from 192.168.1.100: icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from 192.168.1.100: icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from 192.168.1.100: icmp_seq=3 ttl=64 time=0.067 ms
64 bytes from 192.168.1.100: icmp_seq=4 ttl=64 time=0.068 ms

--- 192.168.1.100 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3060ms
rtt min/avg/max/mdev = 0.067/0.072/0.081/0.005 ms

Weeeeeeeeeeeeeeeeeeeeeeee

At this point, the container can only send traffic to addresses in the 192.168.1.0/24 range.

Note that the veth device and the route in the routing table will both be removed automatically from the host when the container is exited.

If the two processes can’t communicate, i.e., the pings don’t work, make sure that the new network you configured for the container processes doesn’t conflict with any others.

Check out the following two resources for more information:

User

The user namespace is the only one that can be created by a non-privileged user. This allows for running rootless containers, which greatly mitigates one of the best-known security implications of containers: running containers as root. Not needing to run containers as a privileged user is a good security practice, but unfortunately not all popular container runtimes allow for this.

Does this mean that a container shouldn’t ever have a root user? No, of course not. Containers need to have a privileged user to do whatever nefarious things containers do.

The critical difference is that you don’t want a root user in a container also mapping to/running as the root user on the host. This is very bad, because if the root user breaks out of the container and the user namespace, then they are also root on the host.

This would mean, to put it mildly, that you would be fucked. Remember that the host can see everything that runs on it, including containers? The attacker could then do whatever they wanted, and most likely, do it gleefully.

Sadly, most people who run containers use Docker, and Docker was not built with security in mind. It was an afterthought. Maybe a Docker Captain can tell you about it someday.

So, what does one do? Curse the decision to promote Docker? Well, yes. Alias docker to podman? Also, yes. But, further, critically, use the user namespace to map the root user in the container to a non-privileged account on the host. That way, if an attacker breaks out of the container, the worst they can do is delete the poems in your home directory.

In addition, the effective user on the host can have greater capabilities inside the container running as root.

Let’s see how that mapping is done.

Rootless Containers

Let’s create a process that inherits all of its parent’s namespaces and check out the user information:

$ unshare bash
$ id
uid=1000(btoll) gid=1000(btoll) groups=1000(btoll)

Ok, it’s running in the same user namespace as the PPID (parent process ID) and has inherited the effective user that created the child process.

How about when creating the child process as a privileged user?

$ sudo unshare bash
# id
uid=0(root) gid=0(root) groups=0(root)

That’s interesting and to be expected. Let’s now create a process with an unshared user namespace:

$ unshare --user bash
$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

Ok, the nobody user. Let’s confirm that the process has a user namespace distinct from the host:

On the host:

$ lsns -t user | ag "systemd|bash"
4026531837 user      57    1881 btoll /lib/systemd/systemd --user
4026533574 user       1 2713100 btoll bash

Why can’t we run the lsns as an unprivileged user in the container as before?

Great question, Eagle Eye. Since we’re now using an unshared user namespace, the user in the container is nobody, not btoll, the effective user that created the process.

So, the nobody user can only see its own user namespace, and the btoll user doesn’t exist in that namespace.

Ok, let’s now do the mapping.

The /proc/PID/uid_map and /proc/PID/gid_map are the kernel interfaces for the user and group IDs, respectively, for each process.

First, we need to get the pid of the “container” (really, the bash process):

In the container:

$ echo $$
2713100

On the host:

$ sudo echo "0 1000 1" >> /proc/2713100/uid_map

These special /proc files can only be written to once.

Let’s break that down (0 1000 1):

(0) The first number is the start of the range of the user IDs that will be used in the container.
(1000) The second number is the start of the range of the user IDs that will be mapped to on the host.
(1) The last number states the number of user IDs (the length of the range) that will be mapped in the container. Only one user ID is given, since we only expect one user for this container. That’s a good thing.

This example is doing the following, in plain English: “Map the effective user ID on the host with user ID 1000 to the user ID of 0 in the container, and only allocate one user ID.”

The end result is that my btoll (1000) user on the host is now seen as root in the new user namespace in the container.

Kool moe dee.

After running the mapping command above, we see that the mapping has taken effect in the container:

$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)

The user may still say nobody in the prompt, but this is expected since the shell init scripts like .bash_profile haven’t been run again. Rest assured, though, the user is the privileged root user in the container.

After having gone through all of those contortions to write to the /proc/PID/uid_map after the container has been created to set up the root user mapping, let’s now look at a very simple way to do it as a switch to the unshare command:

$ unshare --map-root-user bash
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# cat /proc/$$/uid_map
         0       1000          1

Of course, the --map-root-user switch implies the creation of a new user namespace:

In the container:

# lsns -t user
        NS TYPE  NPROCS     PID USER COMMAND
4026533651 user       2 2935027 root bash

On the host:

$ sudo ls -l /proc/1/ns | ag user
lrwxrwxrwx 1 root root 0 Dec 17 16:42 user -> user:[4026531837]

Weeeeeeeeeeeeeeeeeeeee

Lastly, let’s prove to ourselves that this is indeed a rootless container. In other words, let’s show that, although the container is running as a root user, it actually maps to a non-privileged user on the host:

In the container:

$ unshare --map-root-user bash
# sleep 12345 &
[1] 2945562
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# whoami
root
# touch fooby
# ls -li fooby
6038245 -rw-rw-r-- 1 root root 0 Dec 17 17:20 fooby

On the host:

$ ps -ft5
UID          PID    PPID  C STIME TTY          TIME CMD
btoll    2690597    2294  0 Dec16 pts/5    00:00:00 -bash
btoll    2944466 2690597  0 17:14 pts/5    00:00:00 bash
btoll    2945562 2944466  0 17:16 pts/5    00:00:00 sleep 12345
$ ls -li /home/btoll/fooby
6038245 -rw-rw-r-- 1 btoll btoll 0 Dec 17 17:20 /home/btoll/fooby

Told you so.

Capabilities

Note that the capabilities may be augmented depending on the mapping. Below we see an example of a process not being able to create a mnt namespace because the effective user does not have the correct permissions:

$ unshare --mount sh
unshare: unshare failed: Operation not permitted
$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

However, if we give the user in that container escalated privileges by running the now-familiar mapping command on the host, we can effect that by mapping the non-privileged user on the host to the root user in the container. This will allow us to do what we want. Remember, if the root user does find a way to break out of the user namespace, the damage will be limited to only what the btoll user is permitted to do on the host.

And, although dangerously handsome and charming, the btoll user won’t be able to do any effective damage on the host due to lack of permissions.

Again, this will look like the following, if the new container process has the PID 2713100:

$ sudo echo "0 1000 1" >> /proc/2713100/uid_map

Back in the container, we see that the user is now root and now has the capabilites needed to unshare any other namespace:

$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
$ unshare --mount sh
\u@\h:\w$

Let’s wrap up this section by looking at the capabilities for a rootless container:

On the host:

$ capsh --print | ag "Current|uid"
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
 secure-no-suid-fixup: no (unlocked)
uid=1000(btoll) euid=1000(btoll)

In the container:

# capsh --print | ag "Current|uid"
Current: =ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
 secure-no-suid-fixup: no (unlocked)
uid=0(root) euid=0(root)
#
# cat /proc/$$/uid_map
         0       1000          1

Note that the privileges have been escalated in the container, and the last cat command shows indeed that the mapping of root in the container is to btoll(1000) on the host.

Weeeeeeeeeeeeeeeeeeeeeeeeeee

/etc/sub{u,g}id

You may be thinking to yourself, “hey, the format of the /proc/PID/uid_map looks suspiciously like the files /etc/subuid and /etc/subgid that I’ve configured to map users when using Docker”, or something to that effect.

And you’d be right. Unfortunately, I haven’t been able to ascertain the provenance of those files.

Are they an implementation detail of an OCI spec? Do they predate the burgeoning popularity and ubiquity of containers? After all, the files are listed in the useradd man page in the FILES section, and the getuid and getgid system calls, et al., also use them.

Does the Shadow know?

I think it’s safe to say that regardless of its origin, these files allow for an easier way to map users within a user namespace than what we’ve seen in the examples above.

nsswitch.conf

Interestingly, I stumbled across the subuid(5) man page when trying to find the provenance of the /etc/sub{u,g}id files, and it states the following:

The delegation of the subordinate uids can be configured via the subid field in /etc/nsswitch.conf file. Only one value can be set as the delegation source. Setting this field to files configures the delegation of uids to /etc/subuid.

I have not tested this, but this would be a great area to explore.

Summary

Um.

On Unsharing Namespaces, Part Two

Namespaces

Network

Sharing

Unsharing

Connectivity

User

Rootless Containers

Capabilities

/etc/sub{u,g}id

nsswitch.conf

Summary

References