“There is no such thing as a Docker container!”
– Kilgore Trout to Gaias Julius Caesar, March 15, 44 B.C.E.
This is the second installment in a riveting series. Be sure to have read the first part, which covers the uts
, pid
and mnt
namespaces!
There are more namespaces than just the ones we’re looking at in this series.
Namespaces
Network
Unsharing the net
network namespace allows for the process to have its own view of any network interface cards and routing tables.
Let’s look at the difference between sharing, or inheriting, the net
namespace from the parent process and unsharing it.
Sharing
Obviously, in the absence of the --net
option to unshare
, the bash
program running in the forked process will inherit the --net
namespace from its parent, and we can see this by listing out the processes ns
directory in /proc
:
# On the host.
$ unshare bash
# In the container process.
$ ls -l /proc/$$/ns | ag net
lrwxrwxrwx 1 btoll btoll 0 Aug 9 17:52 net -> net:[4026532008]
Next, we demonstrate on the host that PID 1 (systemd
) indeed has the same net
namespace, which the containing process inherited through its parent.
# On the host, where PID 1 is `systemd`.
$ sudo ls -l /proc/1/exe
lrwxrwxrwx 1 root root 0 Aug 2 15:21 /proc/1/exe -> /lib/systemd/systemd
$ sudo ls -l /proc/1/ns | ag net
lrwxrwxrwx 1 root root 0 Aug 7 20:19 net -> net:[4026532008]
Back in the container process, we can show that the new process can see all of the same namespaced network interfaces as the host and accesses the same routing table since it inherited the same net
namespace:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp2s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 80:fa:5b:53:fb:82 brd ff:ff:ff:ff:ff:ff
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether e4:70:b8:b4:22:a6 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.10/24 brd 192.168.1.255 scope global dynamic noprefixroute wlp3s0
valid_lft 196879sec preferred_lft 196879sec
inet6 fe80::2308:ab5:dc8:cdae/64 scope link noprefixroute
valid_lft forever preferred_lft forever
$
$ ip route
default via 192.168.1.1 dev wlp3s0 proto dhcp metric 600
169.254.0.0/16 dev wlp3s0 scope link metric 1000
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.1.0/24 dev wlp3s0 proto kernel scope link src 192.168.1.10 metric 600
Unsharing
Now, when creating its own net
namespace, we can also see that the net
namespaces are not the same:
# On the host.
$ sudo unshare --net bash
# In the container process.
root@kilgore-trout:/home/btoll# ls -l /proc/$$/ns | ag net
lrwxrwxrwx 1 root root 0 Aug 9 17:58 net -> net:[4026533295]
# Again, on the host, where PID 1 is `systemd`.
$ sudo ls -l /proc/1/exe
lrwxrwxrwx 1 root root 0 Aug 2 15:21 /proc/1/exe -> /lib/systemd/systemd
$ sudo ls -l /proc/1/ns | ag net
lrwxrwxrwx 1 root root 0 Aug 7 20:19 net -> net:[4026532008]
Back in the container process, we can show that the new process only has a loopback
interface and has no routing table information:
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# ip route
Error: ipv4: FIB table does not exist.
Dump terminated
Weeeeeeeeeeeeeeeeeeeeeeeeeeee
Connectivity
Let’s now establish network connectivity between the host and the container process by creating two virtual Ethernet interfaces.
Conceptually, we can think of this as a cable that connects the default net
network namespace with the new net
network namespace of the container.
We’ll start by creating the new process with its own unshared net
network namespace:
$ sudo unshare --net bash
Right away, we can see that the new process has its own net
namespace that is distinct from the host:
# lsns --type net -o NS,PID,COMMAND | ag "systemd|bash"
4026532008 1 /lib/systemd/systemd --system --deserialize 18
4026532801 2561438 bash
As we can see from the column options passed to the output parameter (-o
), the first column is the net
namespace, the second the process ID and the third the command that created the process.
We’ll need that PID of the new process in order to create its virtual network interface. Note that we can also get it inside the container by echoing out the current process ID using a special Bash parameter:
# echo $$
2561438
The previous commands (
lsns
andecho
) were run in the container, but they could have also been run on the host. Also, note that the command is limiting the output to only the namespace (NS), process ID (PID) and command (COMMAND).
Note that there are no entries in the routing table yet in the container, and the only device is loopback
:
# ip route
Error: ipv4: FIB table does not exist.
Dump terminated
#
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Ok, that’s great. Let’s now connect the new net
namespace to the default net
namespace:
$ sudo ip link add ve1 netns 2561438 type veth peer name ve2 netns 1
Let’s break that down like a hip beat:
- We’re adding a new virtual Ethernet interface called
ve1
and binding it to the process with ID 2561438.- Note that we called have called this anything, it doesn’t have to be
ve1
. It could have been calledpoop1
.
- Note that we called have called this anything, it doesn’t have to be
- The type
veth
is a virtual Ethernet interface. - The
peer
keyword means that we’re joining the two new interfaces together. - We’re adding a new virtual Ethernet interface called
ve2
and binding it to the process with ID 1.- Note that we called have called this anything, it doesn’t have to be
ve2
. It could have been calledpoop2
.
- Note that we called have called this anything, it doesn’t have to be
In the container, we can now see that the new virtual Ethernet device has indeed been added:
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3745: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1e:93:3b:e3:8f:32 brd ff:ff:ff:ff:ff:ff link-netnsid 0
And we’ll bring it up:
# ip link set ve1 up
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3745: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1e:93:3b:e3:8f:32 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::1c93:3bff:fee3:8f32/64 scope link
valid_lft forever preferred_lft forever
We’ll do the same on the host:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
...
3745: ve2@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
$
$ sudo ip link set ve2 up
$ ip a
...
3745: ve2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
inet6 fe80::9cea:69ff:fe72:e31e/64 scope link
valid_lft forever preferred_lft forever
Of course, in order to be able to send traffic to the devices, both need to be assigned an IP address on the same network.
First, in the container:
# ip addr add 192.168.1.100/24 dev ve1
root@kilgore-trout:/home/btoll# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ve1@if3745: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 1e:93:3b:e3:8f:32 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.1.100/24 scope global ve1
valid_lft forever preferred_lft forever
inet6 fe80::1c93:3bff:fee3:8f32/64 scope link
valid_lft forever preferred_lft forever
As soon as the IP is associated, a route has been added to the container’s routing table:
# ip route
192.168.1.0/24 dev ve1 proto kernel scope link src 192.168.1.100
And on the host:
$ sudo ip addr add 192.168.1.200/24 dev ve2
$ ip a
...
3745: ve2@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 9e:ea:69:72:e3:1e brd ff:ff:ff:ff:ff:ff link-netnsid 3
inet 192.168.1.200/24 scope global ve2
valid_lft forever preferred_lft forever
inet6 fe80::9cea:69ff:fe72:e31e/64 scope link
valid_lft forever preferred_lft forever
Likewise, when the IP address was associated with the host’s virtual Ethernet device, a new route was added to its routing table:
$ ip route
default via 192.168.1.1 dev wlp3s0 proto dhcp metric 20600
...
192.168.1.0/24 dev ve2 proto kernel scope link src 192.168.1.200
192.168.1.0/24 dev wlp3s0 proto kernel scope link src 192.168.1.10 metric 600
Let’s test it!
In the container:
# ping -c4 192.168.1.200
PING 192.168.1.200 (192.168.1.200) 56(84) bytes of data.
64 bytes from 192.168.1.200: icmp_seq=1 ttl=64 time=0.097 ms
64 bytes from 192.168.1.200: icmp_seq=2 ttl=64 time=0.095 ms
64 bytes from 192.168.1.200: icmp_seq=3 ttl=64 time=0.096 ms
64 bytes from 192.168.1.200: icmp_seq=4 ttl=64 time=0.094 ms
--- 192.168.1.200 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3067ms
rtt min/avg/max/mdev = 0.094/0.095/0.097/0.001 ms
On the host:
$ ping -c4 192.168.1.100
PING 192.168.1.100 (192.168.1.100) 56(84) bytes of data.
64 bytes from 192.168.1.100: icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from 192.168.1.100: icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from 192.168.1.100: icmp_seq=3 ttl=64 time=0.067 ms
64 bytes from 192.168.1.100: icmp_seq=4 ttl=64 time=0.068 ms
--- 192.168.1.100 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3060ms
rtt min/avg/max/mdev = 0.067/0.072/0.081/0.005 ms
Weeeeeeeeeeeeeeeeeeeeeeee
At this point, the container can only send traffic to addresses in the 192.168.1.0/24 range.
Note that the veth
device and the route in the routing table will both be removed automatically from the host when the container is exited.
User
The user
namespace is the only one that can be created by a non-privileged user. This allows for running rootless containers, which greatly mitigates one of the best-known security implications of containers: running containers as root. Not needing to run containers as a privileged user is a good security practice, but unfortunately not all popular container runtimes allow for this.
Does this mean that a container shouldn’t ever have a root
user? No, of course not. Containers need to have a privileged user to do whatever nefarious things containers do.
The critical difference is that you don’t want a root
user in a container also mapping to/running as the root
user on the host. This is very bad, because if the root
user breaks out of the container and the user
namespace, then they are also root
on the host.
This would mean, to put it mildly, that you would be fucked. Remember that the host can see everything that runs on it, including containers? The attacker could then do whatever they wanted, and most likely, do it gleefully.
Sadly, most people who run containers use Docker, and Docker was not built with security in mind. It was an afterthought. Maybe a Docker Captain can tell you about it someday.
So, what does one do? Curse the decision to promote Docker? Well, yes. But, also, critically, use the user
namespace to map the root
user in the container to a non-privileged account on the host. That way, if an attacker breaks out of the container, the worst they can do is delete the poems in your home directory.
In addition, the effective user on the host can have greater capabilities inside the container running as root
.
Let’s see how that mapping is done.
Rootless Containers
Let’s create a process that inherits all of its parent’s namespaces and check out the user information:
$ unshare bash
$ id
uid=1000(btoll) gid=1000(btoll) groups=1000(btoll)
Ok, it’s running in the same user
namespace as the PPID (parent process ID) and has inherited the effective user that created the child process.
How about when creating the child process as a privileged user?
$ sudo unshare bash
# id
uid=0(root) gid=0(root) groups=0(root)
That’s interesting and to be expected. Let’s now create that a process with an unshared user
namespace.
$ unshare --user bash
$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
Ok, the nobody
user. Let’s confirm that the process has a user
namespace distinct from the default.
In the container:
$ echo $$
2713100
On the host:
$ lsns -t user | ag "systemd|bash"
4026531837 user 57 1881 btoll /lib/systemd/systemd --user
4026533574 user 1 2768612 btoll bash
Why can’t we run the
lsns
as an unprivileged user in the container as before?Great question, Eagle Eye. Since we’re now using an unshared
user
namespace, the user in the container isnobody
, notbtoll
, the effective user that created the process.So, the
nobody
user can only see its ownuser
namespace, and thebtoll
user doesn’t exist in that namespace.
Ok, let’s now do the mapping.
The /proc/PID/uid_map
and /proc/PID/gid_map
are the kernel interfaces for the user and group IDs, respectively, for each process.
On the host:
$ sudo echo "0 1000 1" >> /proc/2713100/uid_map
Let’s break that down (0 1000 1
):
- (0) The first number is the start of the range of the user IDs that will be used in the container.
- (1000) The second number is the start of the range of the user IDs that will be mapped to on the host.
- (1) The last number states the number of user IDs (the length of the range) that will be mapped in the container. Only one user ID that is given, since we only expect one user for this container. That’s a good thing.
This example is doing the following, in plain English: “Map the effective user ID on the host with user ID 1000 to the user ID of 1 in the container, and only allocate one user ID.”
The end result is that my btoll
(1000) user on the host is now seen as root
in the new user
namespace in the container.
These special
/proc
files can only be written to once.
After running the mapping command above, we see that the mapping has taken effect in the container:
$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
The user may still say
nobody
in the prompt, but this is expected since the shell init scripts like.bash_profile
haven’t been run again. Rest assured, though, the user is the privilegedroot
user in the container.
After having gone through all of those contortions to write to the /proc/PID/uid_map
after the container has been created to set up the root
user mapping, let’s now look at a very simple way to do it as a switch to the unshare
command:
$ unshare --map-root-user bash
root@kilgore-trout:~/projects/benjamintoll.com# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
root@kilgore-trout:~/projects/benjamintoll.com# cat /proc/$$/uid_map
0 1000 1
Of course, the --map-root-user
switch implies the creation of a new user
namespace:
In the container:
# lsns -t user
NS TYPE NPROCS PID USER COMMAND
4026533651 user 2 2935027 root bash
On the host:
$ sudo ls -l /proc/1/ns | ag user
lrwxrwxrwx 1 root root 0 Dec 17 16:42 user -> user:[4026531837]
Weeeeeeeeeeeeeeeeeeeee
Lastly, let’s prove to ourselves that this is indeed a rootless container. In other words, let’s show that, although the container is running as a root
user, it actually maps to a non-privileged user on the host:
In the container:
$ unshare --map-root-user bash
# sleep 12345 &
[1] 2945562
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# whoami
root
# touch fooby
# ls -li fooby
6038245 -rw-rw-r-- 1 root root 0 Dec 17 17:20 fooby
On the host:
$ ps -ft5
UID PID PPID C STIME TTY TIME CMD
btoll 2690597 2294 0 Dec16 pts/5 00:00:00 -bash
btoll 2944466 2690597 0 17:14 pts/5 00:00:00 bash
btoll 2945562 2944466 0 17:16 pts/5 00:00:00 sleep 12345
$ ls -li /home/btoll/fooby
6038245 -rw-rw-r-- 1 btoll btoll 0 Dec 17 17:20 /home/btoll/fooby
Told you so.
Capabilities
Note that the capabilities may be augmented depending on the mapping. Below we see an example of a process not being able to create a mnt
namespace because the effective user does not have the correct permissions:
$ unshare --mount sh
unshare: unshare failed: Operation not permitted
$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
However, if we give the user in that container escalated privileges by running the now-familiar mapping command on the host, we can effect that by mapping the non-privileged user on the host to the root
user in the container. This will allow us to do what we want. Remember, if the root
user does find a way to break out of the user
namespace, the damage will be limited to only what the btoll
user is permitted to do on the host.
Again, this will look like the following, if the new container process has the PID 2713100:
$ sudo echo "0 1000 1" >> /proc/2713100/uid_map
Back in the container, we see that the user is now root
and now has the capabilites needed to unshare any other namespace:
$ id
uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
$ unshare --mount sh
\u@\h:\w$
Let’s wrap up this section by looking at the capabilities for a rootless container:
On the host:
$ capsh --print | ag "Current|uid"
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
secure-no-suid-fixup: no (unlocked)
uid=1000(btoll) euid=1000(btoll)
In the container:
# capsh --print | ag "Current|uid"
Current: =ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
secure-no-suid-fixup: no (unlocked)
uid=0(root) euid=0(root)
#
# cat /proc/$$/uid_map
0 1000 1
Note that the privileges have been escalated in the container, and the last cat
command shows indeed that the mapping of root
in the container is to btoll(1000)
on the host.
Weeeeeeeeeeeeeeeeeeeeeeeeeee
/etc/sub{u,g}id
You may be thinking to yourself, “hey, the format of the /proc/PID/uid_map
looks suspiciously like the files /etc/subuid
and /etc/subgid
that I’ve configured to map users when using Docker”, or something to that effect.
And you’d be right. Unfortunately, I haven’t been able to ascertain the provenance of those files.
Are they an implementation detail of an OCI spec? Does they predate the burgeoning popularity and ubiquity of containers? After all, the files are listed in the useradd
man page in the FILES
section, and the getuid
and getgid
system calls, et al., also use them.
Does the Shadow know?
I think it’s safe to say that regardless of its origin, these files allow for an easier way to map users within a user
namespace than what we’ve seen in the examples above.
nsswitch.conf
Interestingly, I stumbled across the subuid(5) man page when trying to find the provenance of the /etc/sub{u,g}id
files, and it states the following:
The delegation of the subordinate uids can be configured via the subid field in /etc/nsswitch.conf file. Only one value can be set as the delegation source. Setting this field to files configures the delegation of uids to /etc/subuid.
I have not tested this, but this would be a great area to explore.
Summary
Um.