RootlessKit: the gate to the rootless world
RootlessKit is a kind of Linux-native "fake root" utility, made for mainly running Docker and Kubernetes as an unprivileged user, so as to protect the real root on the host from potential container-breakout attacks.
What it actually does
RootlessKit creates user_namespaces(7)
and mount_namespaces(7)
, and executes newuidmap(1)
/newgidmap(1)
along with subuid(5)
and subgid(5)
.
RootlessKit also supports isolating network_namespaces(7)
with userspace NAT using "slirp".
Kernel NAT using SUID-enabled lxc-user-nic(1)
is also experimentally supported.
Projects using RootlessKit
- Docker/Moby
- Usernetes: Docker & Kubernetes, installable under a non-root user's
$HOME
.
- k3s: Lightweight Kubernetes
- BuildKit: Next-generation
docker build
backend
Setup
$ go get github.com/rootless-containers/rootlesskit/cmd/rootlesskit
$ go get github.com/rootless-containers/rootlesskit/cmd/rootlessctl
or just run make
to make binaries under ./bin
directory.
Requirements
-
newuidmap
and newgidmap
need to be installed on the host. These commands are provided by the uidmap
package on most distributions.
-
/etc/subuid
and /etc/subgid
should contain more than 65536 sub-IDs. e.g. penguin:231072:65536
. These files are automatically configured on most distributions.
$ id -u
1001
$ whoami
penguin
$ grep "^$(whoami):" /etc/subuid
penguin:231072:65536
$ grep "^$(whoami):" /etc/subgid
penguin:231072:65536
Distribution-specific hints
Debian (excluding Ubuntu):
Arch Linux:
sudo sh -c "echo 1 > /proc/sys/kernel/unprivileged_userns_clone"
is required
RHEL/CentOS 7 (excluding RHEL/CentOS 8):
sudo sh -c "echo 28633 > /proc/sys/user/max_user_namespaces"
is required
To persist sysctl configurations, edit /etc/sysctl.conf
or add a file under /etc/sysctl.d
.
Usage
Inside rootlesskit
, your UID is mapped to 0 but it is not the real root:
$ rootlesskit bash
rootlesskit$ id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
rootlesskit$ ls -l /etc/shadow
-rw-r----- 1 nobody nogroup 1050 Aug 21 19:02 /etc/shadow
rootlesskit$ $ cat /etc/shadow
cat: /etc/shadow: Permission denied
Environment variables are kept untouched:
$ rootlesskit bash
rootlesskit$ echo $USER
penguin
rootlesskit$ echo $HOME
/home/penguin
rootlesskit$ echo $XDG_RUNTIME_DIR
/run/user/1001
Filesystems can be isolated from the host with --copy-up
:
$ rootlesskit --copy-up=/etc bash
rootlesskit$ rm /etc/resolv.conf
rootlesskit$ vi /etc/resolv.conf
You can even create network namespaces with Slirp:
$ rootlesskit --copy-up=/etc --copy-up=/run --net=slirp4netns --disable-host-loopback bash
rootlesskit$ ip netns add foo
...
Proc filesystem view:
$ rootlesskit bash
rootlesskit$ cat /proc/self/uid_map
0 1001 1
1 231072 65536
rootlesskit$ cat /proc/self/gid_map
0 1001 1
1 231072 65536
rootlesskit$ cat /proc/self/setgroups
allow
Full CLI options:
NAME:
rootlesskit - the gate to the rootless world
USAGE:
rootlesskit [global options] command [command options] [arguments...]
VERSION:
0.7.0+dev
COMMANDS:
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
--debug debug mode
--state-dir value state directory
--net value network driver [host, slirp4netns, vpnkit, lxc-user-nic(experimental), vdeplug_slirp(deprecated)] (default: "host")
--slirp4netns-binary value path of slirp4netns binary for --net=slirp4netns (default: "slirp4netns")
--slirp4netns-sandbox value enable slirp4netns sandbox (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
--slirp4netns-seccomp value enable slirp4netns seccomp (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
--vpnkit-binary value path of VPNKit binary for --net=vpnkit (default: "vpnkit")
--lxc-user-nic-binary value path of lxc-user-nic binary for --net=lxc-user-nic (default: "/usr/lib/x86_64-linux-gnu/lxc/lxc-user-nic")
--lxc-user-nic-bridge value lxc-user-nic bridge name (default: "lxcbr0")
--mtu value MTU for non-host network (default: 65520 for slirp4netns, 1500 for others) (default: 0)
--cidr value CIDR for slirp4netns network (default: 10.0.2.0/24, requires slirp4netns v0.3.0+ for custom CIDR)
--disable-host-loopback prohibit connecting to 127.0.0.1:* on the host namespace
--copy-up value mount a filesystem and copy-up the contents. e.g. "--copy-up=/etc" (typically required for non-host network)
--copy-up-mode value copy-up mode [tmpfs+symlink] (default: "tmpfs+symlink")
--port-driver value port driver for non-host network. [none, builtin, socat(deprecated), slirp4netns(deprecated)] (default: "none")
--publish value, -p value publish ports. e.g. "127.0.0.1:8080:80/tcp"
--pidns create a PID namespace
--help, -h show help
--version, -v print the version
State directory
The following files will be created in the state directory, which can be specified with --state-dir
:
lock
: lock file
child_pid
: decimal PID text that can be used for nsenter(1)
.
api.sock
: REST API socket for rootlessctl
. See Port Drivers section.
If --state-dir
is not specified, RootlessKit creates a temporary state directory on /tmp
and removes it on exit.
Undocumented files are subject to change.
Environment variables
The following environment variables will be set for the child process:
ROOTLESSKIT_STATE_DIR
(since v0.3.0): absolute path to the state dir
Undocumented environment variables are subject to change.
PID Namespace
When --pidns
(since v0.5.0) is specified, RootlessKit executes the child process in a new PID namespace.
The RootlessKit child process becomes the init (PID=1).
When RootlessKit terminates, all the processes in the namespace are killed with SIGKILL
.
See also pid_namespaces(7)
.
Network Drivers
RootlessKit provides several drivers for providing network connectivity:
--net=host
: use host network namespace (default)
--net=slirp4netns
: use slirp4netns (recommended)
--net=vpnkit
: use VPNKit
--net=lxc-user-nic
: use lxc-user-nic
(experimental)
--net=vdeplug_slirp
: use vdeplug_slirp (deprecated)
Benchmark (Aug 28, 2018):
Implementation |
MTU=1500 |
MTU=4000 |
MTU=16384 |
MTU=65520 |
(rootful veth) |
(52.1 Gbps) |
(45.4 Gbps) |
(43.6 Gbps ) |
(51.5 Gbps) |
rootlesskit --net=slirp4netns |
1.07 Gbps |
2.78 Gbps |
4.55 Gbps |
9.21 Gbps |
rootlesskit --net=vpnKit |
514 Mbps |
526 Mbps |
540 Mbps |
(Unsupported) |
rootlesskit --net=vdeplug_slirp |
763 Mbps |
(Unsupported) |
(Unsupported) |
(Unsupported) |
|
--net=lxc-user-nic
is as fast as rootful veth.
--net=host
(default)
--net=host
does not isolate the network namespace from the host.
Pros:
- No performance overhead
- Supports ICMP Echo (
ping
) when /proc/sys/net/ipv4/ping_group_range
is configured
Cons:
- No permission for network-namespaced operations, e.g. creating iptables rules, running
tcpdump
To route ICMP Echo packets (ping
), you need to write the range of GIDs to net.ipv4.ping_group_range
.
$ sudo sh -c "echo 0 2147483647 > /proc/sys/net/ipv4/ping_group_range"
--net=slirp4netns
(recommended)
--net=slirp4netns
isolates the network namespace from the host and launch slirp4netns for providing usermode networking.
Pros:
- Possible to perform network-namespaced operations, e.g. creating iptables rules, running
tcpdump
- Supports ICMP Echo (
ping
) when /proc/sys/net/ipv4/ping_group_range
is configured
- Supports hardening using mount namespace and seccomp (
--slirp4netns-sandbox=auto
, --slirp4netns-seccomp=auto
, since RootlessKit v0.7.0, slirp4netns v0.4.0)
Cons:
- Extra performance overhead (but still faster than
--net=vpnkit
)
- Supports only TCP, UDP, and ICMP Echo packets
To use --net=slirp4netns
, you need to install slirp4netns.
v0.3.0 or later is recommended.
$ sudo dnf install slirp4netns
or
$ sudo apt-get install slirp4netns
If binary package is not available for your distribution, install from the source:
$ git clone https://github.com/rootless-containers/slirp4netns
$ cd slirp4netns
$ ./autogen.sh && ./configure && make
$ cp slirp4netns ~/bin
The network is configured as follows by default:
- IP: 10.0.2.100/24
- Gateway: 10.0.2.2
- DNS: 10.0.2.3
The network configuration can be changed by specifying custom CIDR, e.g. --cidr=10.0.3.0/24
(requires slirp4netns v0.3.0+).
Specifying --copy-up=/etc
is highly recommended unless /etc/resolv.conf
on the host is statically configured. Otherwise /etc/resolv.conf
in the RootlessKit's mount namespace will be unmounted when /etc/resolv.conf
on the host is recreated, typically by NetworkManager or systemd-resolved.
It is also highly recommended to specyfy--disable-host-loopback
. Otherwise ports listening on 127.0.0.1 in the host are accessible as 10.0.2.2 in the RootlessKit's network namespace.
Example session:
$ rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback bash
rootlesskit$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
link/ether 46:dc:8d:09:fd:f2 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.100/24 scope global tap0
valid_lft forever preferred_lft forever
inet6 fe80::44dc:8dff:fe09:fdf2/64 scope link
valid_lft forever preferred_lft forever
ootlesskit$ ip r
default via 10.0.2.2 dev tap0
10.0.2.0/24 dev tap0 proto kernel scope link src 10.0.2.100
rootlesskit$ cat /etc/resolv.conf
nameserver 10.0.2.3
rootlesskit$ curl https://www.google.com
<!doctype html><html ...>...</html>
Starting with RootlessKit v0.7.0 + slirp4netns v0.4.0, --slirp4netns-sandbox=auto/true/false
(enables mount namespace) and --slirp4netns-seccomp=auto/true/false
(enables seccomp rules) can be used to harden the slirp4netns process.
--net=vpnkit
--net=vpnkit
isolates the network namespace from the host and launch VPNKit for providing usermode networking.
Pros:
- Possible to perform network-namespaced operations, e.g. creating iptables rules, running
tcpdump
Cons:
- Extra performance overhead
- Supports only TCP and UDP packets. No support for ICMP Echo (
ping
) unlike --net=slirp4netns
, even if /proc/sys/net/ipv4/ping_group_range
is configured.
To use --net=vpnkit
, you need to install VPNkit.
$ git clone https://github.com/moby/vpnkit.git
$ cd vpnkit
$ make
$ cp vpnkit.exe ~/bin/vpnkit
The network is configured as follows by default:
- IP: 192.168.65.3/24
- Gateway: 192.168.65.1
- DNS: 192.168.65.1
As in --net=slirp4netns
, specifying --copy-up=/etc
and --disable-host-loopback
is highly recommended.
If --disable-host-loopback
is not specified, ports listening on 127.0.0.1 in the host are accessible as 192.168.65.2 in the RootlessKit's network namespace.
--net=lxc-user-nic
(experimental)
--net=lxc-user-nic
isolates the network namespace from the host and launch lxc-user-nic(1)
SUID binary for providing kernel-mode NAT.
Pros:
- No performance overhead
- Possible to perform network-namespaced operations, e.g. creating iptables rules, running
tcpdump
- Supports ICMP Echo (
ping
) without /proc/sys/net/ipv4/ping_group_range
configuration
Cons:
- Less secure
- Needs
/etc/lxc/lxc-usernet
configuration
To use lxc-user-nic
, you need to install liblxc-common
package:
$ sudo apt-get install liblxc-common
You also need to set up /etc/lxc/lxc-usernet
:
# USERNAME TYPE BRIDGE COUNT
penguin veth lxcbr0 1
The COUNT
value needs to be increased to run multiple RootlessKit instances with --net=lxc-user-nic
simultaneously.
It may take a few seconds to configure the interface using DHCP.
If you start and stop RootlessKit too frequently, you might use up all available DHCP addresses.
You might need to reset /var/lib/misc/dnsmasq.lxcbr0.leases
and restart the lxc-net
service.
Currently, the MAC address is always set to a random address.
Port Drivers
To the ports in the network namespace to the host network namespace, --port-driver
needs to be specified.
--port-driver=none
: do not expose ports (default)
--port-driver=builtin
: use built-in port driver (recommended)
--port-driver=socat
: use socat
binary (deprecated)
--port-driver=slirp4netns
: use slirp4netns API (deprecated)
Benchmark (October 13, 2019):
--port-driver |
Throughput |
builtin |
27.3 Gbps |
slirp4netns |
8.3 Gbps |
socat |
5.2 Gbps |
For example, to expose 80 in the child as 8080 in the parent:
$ rootlesskit --state-dir=/run/user/1001/rootlesskit/foo --net=slirp4netns --disable-host-loopback --copy-up=/etc --port-driver=builtin bash
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock add-ports 0.0.0.0:8080:80/tcp
1
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock list-ports
ID PROTO PARENTIP PARENTPORT CHILDPORT
1 tcp 0.0.0.0 8080 80
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock remove-ports 1
1
You can also expose ports using socat
and nsenter
instead of RootlessKit's port drivers.
$ pid=$(cat /run/user/1001/rootlesskit/foo/child_pid)
$ socat -t -- TCP-LISTEN:8080,reuseaddr,fork EXEC:"nsenter -U -n -t $pid socat -t -- STDIN TCP4\:127.0.0.1\:80"