Preface
I’ve been using Docker Containers for many years, but I’ve always treated them as magical black boxes. I know that Docker uses container runtimes like runc (by default) to create isolated environments to run code. However, I don’t know what “isolated” really means. To unveil the black box, I decided to implement a toy container runtime from scratch in Rust.
Luckily, there are a ton of tutorials and resources online that I can learn from. My implementation is largely based on these two blogs in particular: Linux Containers in 500 Lines of Code & Writing a Container in Rust. As someone who knew very little about Linux, the experience of building a container is extremely eye-opening and rewarding.
Here is a summary of what we will build:
- root filesystem isolation with mount namespace
- resource restriction with cgroups
- limit syscalls with seccomp
- isolate user IDs and group IDs with user namespace and uid mapping
- privilege control with capabilities
In this blog series, I will cover the theory behind and the implementation of a container from the perspective of someone new to Linux. I will also provide as many demos as possible to demonstrate how the Linux primitives that make up a container work.
The full source code is available here.
What exactly is a Container?
The concept of containers is rooted in Linux. Check out this RedHat blog about the history of containers. When people talk about containers, they are more or less talking about Linux containers.
However, the Linux Kernel doesn’t have a native object that represents a “container”. From the perspective of the kernel, containers are just processes. But what makes these processes special?
The best way to look at the properties of a process in a container is to look at some demos with the help of Docker
, a tool that can create and run containers.
Filesystem Isolation
Firstly, a process in a container has an isolated view of the filesystem. In the demo below, we created a container based on the ubuntu
image.
If we navigate to the root directory via cd /
, we notice that the root filesystem of the process in a container is not the same one as the root filesystem on the host system. Modifying the root filesystem within the container will have no impact on the host system.
docker run -it ubuntu bash
cd /
ls
# bin boot dev etc home lib media mnt opt proc
# root run sbin srv sys tmp usr var
# host system
cd /
ls
# bin dev lib mnt opt run srv tmp
# boot etc lost+found proc sbin swapfile usr
# cdrom home media root snap sys var
The new root filesystem comes from the ubuntu
image. A docker image is an executable file. A docker image is made up of filesystems layered over each other. These layers form the base for a container’s root filesystem.
Pid Isolation
Processes in a container have an isolated view of other processes running on the host. In the example below, if we perform ps -a -u
to list all processes in the container, we only see the process running bash
and ps -a -u
. However, if we perform ps -a -u
on the host system, we see a lot more processes.
Furthermore, in the example below the process perceives its pid
as 1
. However, from the perspective of the host system, the process running bash
is 6098
.
docker run -it ubuntu bash
ps -a -u
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
# root 1 0.2 0.0 4136 3200 pts/0 Ss 07:19 0:00 bash
# root 9 0.0 0.0 6412 2432 pts/0 R+ 07:19 0:00 ps -a -u
echo $$
# 1
# host system
ps -a -u
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 6098 0.0 0.0 4136 3200 pts/0 Ss+ 15:19 0:00 bash
User ID Isolation
Processes in a container have an isolated view of things like user IDs and group IDs. This enables a process to run as different users inside and outside the container.
In the example below, we enable the user namespace
via --userns-remap=default
. The process in the container perceives its uid
as 0. But if we look at the user corresponding to the process from the host system, the user is 165536
.
sudo dockerd --userns-remap=default
sudo docker run -it --rm busybox /bin/sh
id
# uid=0(root) gid=0(root) groups=0(root),10(wheel)
# host system
ps -a -u
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
# ...
# 165536 14154 0.0 0.0 3984 1920 pts/0 Ss+ 14:33 0:00 /bin/sh
Resource Restriction
In Docker, you can constrain resources that the container can access. For example, you can limit the amount of memory the process can take, the number of CPUs the container can run on, etc. Check out Docker’s doc for the full list of resources that can be constrained.
As an example, here is how you can limit the container to have a memory limit of 128 mb.
docker run -it --memory 128m ubuntu bash
Secret behind Containers
The secret behind how a container can provide the isolation properties demonstrated above boils down to the following Linux primitives:
- Namespaces
- Capabilities
- cgroups
We will cover these in greater detail throughout the blog!
API
Before we talk about the theory of and implementation behind containers, let’s first look at the API for my toy container.
At the core, the mini-container
program takes two arguments: an executable program and a directory that points to a root filesystem. It creates a process, sets up the container environment for the process, and executes the executable program in this container.
Here are the arguments and options to execute my toy container.
mini-container [OPTIONS] <PATH_TO_EXECUTABLE> <ROOT_FILESYSTEM_PATH>
Arguments:
<COMMAND>
Command to execute<ROOT_FILESYSTEM_PATH>
Absolute path to the new root filesystem
Options:
-p, --pid <PID>
Set the pid for child process
-m, --memory <MEMORY>
Memory limit (megabytes)
--nproc <NPROC>
Max pids allowed
-u, --user <USER>
Set the User ID for child process
--cap-add <CAP_ADD>
Add Linux capabilities to the container environment
--cap-drop <CAP_DROP>
Drop Linux capabilities to the container environment. Specify “ALL” to drop all
-h, --help
Examples
Running an interactive bash shell
To run an interactive bash shell in the container environment, you first need to set up a directory that will serve as the root filesystem for the container. This is equivalent to an image in Docker, which contains a minimal OS. For all my demos, I will be using Alpine’s Mini Root Filesystem image.
First, we download the image and extract it into the alpine
directory.
cd /home/brianshih
# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine
Next, we can launch the container and execute /bin/ash
. Note that the alpine
directory will become the new root filesystem.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
Here is the rough equivalent command in docker:
docker exec -it alpine bash
Limiting resources in the container
You can run a container with limited memory and limited process capacity via the --nproc
and --memory
options.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
--nproc 5 --memory 1048
Here is the rough equivalent command in docker - though unlike my implementation, nproc
in Docker sets the maximum number of processes available to a user, not to a container.
docker run --memory="1048m" --ulimit nproc=5 IMAGE
Dropping and Adding Linux Capabilities
Here is how you can drop all the Linux capabilities and add the NET_BIND_SERVICE
capability. Note that for my toy implementation, I only support 3 capabilities (so far). It’s extremely trivial to add them but my goal isn’t to build a production-level container so I stopped whenever I felt like I understood how they work.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
--cap-drop ALL
--cap-add NET_BIND_SERVICE
Here is the rough equivalent command in docker:
docker run --cap-drop all --cap-add NET_BIND_SERVICE alpine
Setting the User ID
Here is how you can set the user ID for the process.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine --user 0
Here is the rough equivalent command in docker:
docker run --rm --user $UID:$GID alpine ash
Project Overview
Here is the higher-level setup for this project:
- parse the command line argument
- create the child process
- setup the namespaces, capabilities, and syscalls restrictions
- executing the program
Parse the command line argument
To parse the command line arguments, we use the clap crate. Here is the struct representation of the parsed arguments:
#![allow(unused)] fn main() { #[derive(Parser)] struct Cli { /// Command to execute command: String, /// Absolute path to new root filesystem root_filesystem_path: String, /// Optional pid for child process #[arg(short, long)] pid: Option<u32>, /// Memory limit (megabytes) #[arg(short, long)] memory: Option<i64>, /// Memory limit (megabytes) #[arg(long)] nproc: Option<i64>, /// Memory limit (megabytes) #[arg(short, long)] user: Option<u32>, // Add capabilities to the bounding set #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')] cap_add: Option<Vec<String>>, // Remove capabilities to the bounding set, or all if the String provided is "ALL" #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')] cap_drop: Option<Vec<String>>, } }
The entry point of the project is the run
method. All we have to do is call Cli::parse()
to parse the arguments
fn main() { if let Err(_) = run() { cleanup(); exit(-1); } } fn run() -> ContainerResult { let cli = Cli::parse(); ... }
Create the child process
Since a container is just a process, we need to create the child process for the container. The create_child_process
function is responsible for that.
#![allow(unused)] fn main() { fn run() -> ContainerResult { let cli = Cli::parse(); ... let child_pid = create_child_process(&config)?; if let Err(e) = waitpid(child_pid, None) { return Err(ContainerError::WaitPid); }; Ok(()) } }
After creating the child process, we need to make sure the parent process doesn't terminate until the child process completes. We use the waitpid call to make sure of that.
Here is the implementation for create_child_process
:
#![allow(unused)] fn main() { // Creates a child process with clone and runs the executable file // with execve in the child process. fn create_child_process(config: &ChildConfig) -> Result<Pid, ContainerError> { let mut flags = CloneFlags::empty(); flags.insert(CloneFlags::CLONE_NEWNS); flags.insert(CloneFlags::CLONE_NEWCGROUP); flags.insert(CloneFlags::CLONE_NEWPID); flags.insert(CloneFlags::CLONE_NEWIPC); flags.insert(CloneFlags::CLONE_NEWNET); flags.insert(CloneFlags::CLONE_NEWUTS); let mut stack = [0; STACK_SIZE]; let clone_res = unsafe { clone( Box::new(|| match child(config) { Ok(_) => 0, Err(_) => -1, }), &mut stack, flags, Some(Signal::SIGCHLD as i32), // If the signal SIGCHLD is ignored, waitpid will hang until the // child exits and then fail with code ECHILD. ) }; match clone_res { Ok(pid) => { println!("Child pid: {:?}", pid); Ok(pid) } Err(_) => Err(ContainerError::Clone), } } }
It uses clone
to create the child process. It clones with a bunch of flags such as CLONE_NEWNS
, CLONE_NEWPID
, etc in order to create the different namespaces (user, mount, pid, etc) necessary for isolation. We will cover these namespaces in more detail later.
The Linux clone method takes a function argument. When the function returns, the child process terminates. The function we pass to clone is the child
method whose responsibility is to set up the container environment and execute the user-provided program.
Setup the namespaces, capabilities, and syscalls restrictions & Executing the program
Here is the implementation of child
:
#![allow(unused)] fn main() { // setup the namespaces, capabilities, syscall restrictions before running the executable fn child(config: &ChildConfig) -> ContainerResult { set_hostname(config)?; isolate_filesystem(config)?; user_ns(config)?; capabilities(config)?; syscalls()?; match execve::<CString, CString>(&config.exec_path, &config.args, &[]) { Ok(_) => Ok(()), Err(e) => { println!("Failed to execute!: {:?}", e); Err(ContainerError::Execve) } } } }
Before using execve
to execute the user-provided program, we set up the container environment for the execution by isolating the filesystem, setting up the user namespace, granting and taking away capabilities, and restricting syscalls.
Summary
To summarize, the project contains these core methods:
- run: parses the command line arguments. Creates the child process and waits until the child process terminates
- create_child_process: uses clone to create the child process. Pass in the
child
as the function argument toclone
- child: sets up the container environment before executing the user-provided program with
execve
For the rest of this blog, we will focus on learning how we can set up the container environment for the process. For each component of the container environment, we will break it down into:
- Goal
- Theory
- Demo
- Implementation
- Testing the Implementation
Isolate Filesystem
Goal
We want to provide a process with an isolated view of the filesystem. In other words, we want to ensure the process cannot touch any files and directories from the host’s filesystem.
Theory
A filesystem is an organized collection of files and directories. Each directory can be backed by a different filesystem. This is the power of the UNIX filesystem abstraction - all directories and files from all filesystems reside under a single directory tree.
To attach a filesystem to a directory, we use the mount
command. The directory that we mount to is also known as the mount point.
$ mount device directory
To isolate the filesystem, we need to ensure that the process cannot have access to or modify any mounts of the host system. This is achieved with the help of the mount namespace.
Mount Namespace
According to Linux’s doc, “mount namespaces provide isolation of the list of mounts seen by the processes in each namespace instance. All of the processes that reside in the same mount namespace will see the same view in these files”. Each mount namespace has its own set of mount points, and modifications to the mount points in one namespace do not affect other namespaces.
A new mount namespace can be created using either clone
or unshare
with the CLONE_NEWNS
flag. There are a few things to keep in mind - if the namespace is created from clone
, the parent process’s mount namespace will be copied to the child namespace; if the namespace is created from unshare
, the caller’s previous mount namespace will be copied to the child namespace. This means that modifying files or directories in a newly created mount namespace can affect the host system.
To achieve isolation, we can use unmount
to tear down the root mount. This will not affect the mount list seen in the host system because modifications to the mount list (via mount
and unmount
) will not affect other mount namespaces.
However, unmounting the root filesystem is usually not allowed because any files open in the root filesystem would prevent the unmount. But even if we manage to unmount the root filesystem, the system would be unusable as the process won’t be able to load any executables or access any devices.
Instead, what we want is to swap out the root filesystem with a new filesystem that contains the minimal required system files and libraries. This is where pivot_root
comes in.
pivot_root
pivot_root
is a system call that allows us to change the root mount in the mount namespace of the calling process. It takes two directories as arguments - new_root
and put_old
and it “moves the root mount to the directory put_old
and makes new_root
the new root mount.” The put_old
directory must be at or underneath new_root
.
$ pivot_root new_root put_old
Here are the steps to use pivot_root
to achieve filesystem isolation for the container:
- create the
new_root
directory that will become the new root filesystem. An empty root filesystem is useless, so we need to put any necessary files to run the application into thenew_root
directory.- But how do you determine the “necessary files” to run an application? This is where Docker images become useful - Docker images can be thought of as an archive of root filesystems. We can download an image (like alpine) and extract it into the
new_root
directory. An image likealpine
would not download the entire OS but an essential set of files ofalpine
.
- But how do you determine the “necessary files” to run an application? This is where Docker images become useful - Docker images can be thought of as an archive of root filesystems. We can download an image (like alpine) and extract it into the
- create a
put_old
directory inside thenew_root
directory. - create a new mount namespace with
unshare
- mount the
new_root
as Linux requires that the new_root is a mount point before changing the root filesystem - use
pivot_root
to makenew_root
the new root filesystem. Theput_old
directory now points to the original root filesystem. - unmount the
put_old
filesystem and remove theput_old
directory.
After those steps, we have an isolated filesystem. Don’t worry if this seems a bit abstract, I will walk through this in detail in Demo 2 below.
Demo
Demo 1: mount namespace
First, let’s demonstrate that within a mount namespace, mounting or unmounting a filesystem wouldn’t affect other namespaces.
mkdir /tmp/ex
mkdir /tmp/ex/one
sudo unshare -m /bin/bash
mount -t tmpfs tmpfs /tmp/ex
ls /tmp/ex
# empty
mkdir /tmp/ex/foo
ls /tmp/ex
# foo
# From the host system
ls /tmp/ex
# one
In the example above, we created a directory /tmp/ex
and a directory /one
under it.
Next, we created a new mount namespace with unshare
and mount
ed a tmpfs
filesystem onto /tmp/ex
.
At this point, /tmp/ex
is replaced with a new filesystem. We confirm that it’s no longer related to the filesystem in /tmp/ex
in the host system by using ls
to list all directories inside /tmp/ex
and not seeing the /one
directory we created earlier.
To show that modifications to the mounted filesystem have no impact on the host system, we created a directory foo
under /tmp/ex
. We perform ls /tmp/ex
to confirm that foo
is inside the directory.
Now when we check what’s inside /tmp/ex
from the host system, we only see the original one
directory and not the foo
directory. This confirms that mounting a filesystem won’t affect other namespaces.
As a side note, if you ever want to see which processes are inside which mount namespace, you can use the ps
command or look at /proc/self/ns/mnt
like below:
sudo unshare -m /bin/bash
echo $$
# 6766
ps -o pid,mntns,args
# PID MNTNS COMMAND
# 6765 4026531841 sudo unshare -m /bin/bash
# 6766 4026532469 /bin/bash
# 6772 4026532469 ps -o pid,mntns,args
ls -l /proc/self/ns/mnt
# lrwxrwxrwx 1 root root 0 Dec 19 16:13 /proc/self/ns/mnt -> 'mnt:[4026532469]'
Demo 2: isolate filesystem with pivot_root
Earlier, we outlined the steps to use pivot_root
to achieve filesystem isolation. Let’s put that into practice. We will be using Alpine’s mini root filesystem image as the new root filesystem. Here are the commands:
# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine
cd alpine
echo > I_AM_ALPINE.txt
# create the mount namespace
sudo unshare -m
# make the new_root directory a mount point
mount --bind alpine alpine
# create the put_old directory
mkdir alpine/oldrootfs
cd alpine
# swap out the root filesystem
pivot_root . oldrootfs
cd /
ls
# I_AM_ALPINE.txt. # bin etc lib mnt opt root sbin sys usr
# dev home media old_root proc run srv tmp var
ls /oldroot/
# bin cdrom etc lib media old opt root sbin srv sys usr
# boot dev home lost+found mnt old2 proc run snap swapfile tmp var
umount -l old_root/
rmdir old_root/
We first download the alpine image and extract the alpine image into a newly created alpine
directory that will serve as the new_root
in pivot_root
. Next, we create a mount point from the alpine
directory.
Next, we need to create the put_old
directory for pivot_root
under the alpine
directory, which is alpine/oldrootfs
. Finally, we use pivot_root
to swap out the root filesystem.
If we navigate to the root directory via cd /
and verify that the root directory is indeed the Alpine filesystem (as it contains I_AM_ALPINE.txt
). However, we can still see the old_root
directory which points to the original root filesystem. Therefore, we need to unmount it and remove the directory to be isolated from the original filesystem of the host.
We can also verify the mount points in the host system as follows:
# host system. 10920 is the pid of the process with the isolated filesystem
cat /proc/10920/mounts
# /dev/vda2 / ext4 rw,relatime,errors=remount-ro 0 0
cat /proc/10920/mountinfo
# 1066 985 252:2 /home/brianshih/alpine / rw,relatime - ext4 /dev/vda2 rw,errors=remount-ro
We can see that the process with the isolated filesystem only has one mount point whose root is /home/brianshih/alpine
.
Implementation
The implementation for my toy container is more-or-less just Demo 2 in the form of Rust code.
Here is a wrapper helper function around the mount
system call. Note that according to the Linux doc, if only the directory is provided, then mount
modifies an existing mount point.
#![allow(unused)] fn main() { // Wrapper around the mount syscall fn mount_filesystem( filesystem_path: Option<&PathBuf>, target_directory: &PathBuf, flags: Vec<MsFlags>, ) -> ContainerResult { let mut mountflags = MsFlags::empty(); for flag in flags { mountflags.insert(flag); } match mount::<PathBuf, PathBuf, PathBuf, PathBuf>( filesystem_path, target_directory, None, mountflags, None, ) { Ok(_) => Ok(()), Err(err) => { return Err(ContainerError::MountSysCall); } } } }
Here is the code that isolates the filesystem.
#![allow(unused)] fn main() { fn isolate_filesystem(config: &ChildConfig) -> ContainerResult { mount_filesystem( None, &PathBuf::from("/"), vec![MsFlags::MS_REC, MsFlags::MS_PRIVATE], )?; let filesystem_path = PathBuf::from("/home/brianshih/alpine"); mount_filesystem( Some(&filesystem_path), &filesystem_path, vec![MsFlags::MS_BIND, MsFlags::MS_PRIVATE], )?; let root_filesystem_path = &config.root_filesystem_directory; let old_root_path = "oldrootfs"; let old_root_absolute_path = PathBuf::from(format!("{root_filesystem_path}/{old_root_path}")); if let Err(e) = create_dir_all(&old_root_absolute_path) { return Err(ContainerError::CreateDir); } if let Err(e) = pivot_root(&filesystem_path, &PathBuf::from(old_root_absolute_path)) { return Err(ContainerError::PivotRoot); }; if let Err(e) = umount2( &PathBuf::from(format!("/{old_root_path}")), MntFlags::MNT_DETACH, ) { return Err(ContainerError::Umount); } if let Err(e) = remove_dir(&PathBuf::from(format!("/{old_root_path}"))) { return Err(ContainerError::RemoveDir); }; // Change the directory to the root directory if let Err(e) = chdir(&PathBuf::from("/")) { return Err(ContainerError::ChangeDir); }; Ok(()) } }
Something we didn’t cover is the propagation type of a mount point. Each mount point is one of four types: MS_SHARED
, MS_PRIVATE
, MS_SLAVE
, and MS_UNBINDABLE
. Mount points of type MS_SHARED
are shared across different mounts of the same peer group (learn more about peer groups here). Mount points of type MS_PRIVATE
do not propagate events to their peers.
In my code snippet, we recursively set all mount points in the root filesystem to MS_PRIVATE
to make sure that no events are propagated to other mount namespaces.
Apart from that, the code is fairly straightforward and reproduces what we did in Demo 2.
Testing the Implementation
To test whether the process has an isolated filesystem, we first create the alpine
directory which will serve as the directory for the new root.
Next, we create the container environment and run /bin/ash
via sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
. /home/brianshih/alpine
is the path to the new root filesystem for our container.
After that, we navigate to the root directory and confirm that the root filesystem is the one created from the alpine
directory.
# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine
cd alpine
echo > I_AM_ALPINE.txt
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
cd /
ls
# I_AM_ALPINE.txt lib root tmp
# bin media run usr
# dev mnt sbin var
# etc opt srv
# home proc sys
Additional Resources
- Docker Security Blog - Mount Namespace
- Building a container from scratch - part 2
- Blog about mount namespaces and shared subtrees
Limit Syscalls
Goal
Restrict the number of system calls that the running process can make to protect the host system.
Theory
Certain system calls may pose security risks or impact the host system. Seccomp (Secure Computing Mode) is a Linux feature that allows developers to filter system calls to the kernel. Seccomp operates in two modes:
- Strict: a minimal set of syscalls is allowed
- Filter: allows developers to define custom policies for which syscalls are permitted
Seccomp filters are expressed as Berkeley Packet Filters (BPF) programs. These filters can be used to allow or deny system calls, as well as conditionally filter on system call arguments.
For this project, we will be using the following seccomp system calls
- seccomp_init: initializes the seccomp filter
- seccomp_rule_add: add a new filter rule to the current seccomp filter
- seccomp_load: load the current seccomp filter into the kernel
Each filter in the seccomp filter returns an action, that can be one of:
- SCMP_ACT_KILL: kill the thread
- SCMP_ACT_KILL_PROCESS: kill the process
- SCMP_ACT_TRAP: throw a SIGSYS signal
- SCMP_ACT_ERRNO: return value with the specified error code:
- SCMP_ACT_TRACE: notify the tracer
- SCMP_ACT_LOG: logged
- SCMP_ACT_ALLOW: allowed
- SCMP_ACT_NOTIFY: notify the monitoring process
In short, Seccomp allows us to set rules that determine what happens when certain system calls are invoked. Seccomp is a powerful tool. But knowing which system calls to filter out is the tricky part. In this blog, I will focus only on the mechanism of filtering system calls and not discuss which system calls are dangerous. For an explanation of that, I suggest Lizzie’s blog or Docker’s documentation.
Demo
For this project, we will be using the syscallz crate, a seccomp library for Rust.
In the following example, we will try and limit the getpid
system call. In the library, Context::init_with_action
, ctx.set_action_for_syscall
and ctx.load()
are wrappers around seccomp_init
, seccomp_rule_add
, and seccomp_load
.
use libc::getpid; use syscallz::{Action, Context, Syscall}; fn main() { println!("pid (first attempt):, {}", unsafe { getpid() }); match Context::init_with_action(Action::Allow) { Ok(mut ctx) => { ctx.set_action_for_syscall(Action::Errno(100), Syscall::getpid) .unwrap(); ctx.load().unwrap(); } Err(e) => { println!("Failed to init with action: {:?}", e); } } println!("pid (second attempt):, {}", unsafe { getpid() }); }
Compiling and executing the code above yields the following, where -100
is the corresponding error code.
pid (first attempt):, 6613
pid (second attempt):, -100
Implementation
For my project, I disabled the same set of syscalls that Lizzie’s implementation of container disables. Here is the implementation:
#![allow(unused)] fn main() { const DISABLED_SYSCALLS: [Syscall; 9] = [ Syscall::keyctl, Syscall::add_key, Syscall::request_key, Syscall::ptrace, Syscall::mbind, Syscall::migrate_pages, Syscall::set_mempolicy, Syscall::userfaultfd, Syscall::perf_event_open, ]; fn syscalls() -> ContainerResult { let s_isuid: u64 = Mode::S_ISUID.bits().into(); let s_isgid: u64 = Mode::S_ISGID.bits().into(); let clone_newuser = CloneFlags::CLONE_NEWUSER.bits() as u64; // Each tuple: (SysCall, argument_idx, value). 0 would be the first argument index. let conditional_syscalls = [ (Syscall::fchmod, 1, s_isuid), (Syscall::fchmod, 1, s_isgid), (Syscall::fchmodat, 2, s_isuid), (Syscall::fchmodat, 2, s_isgid), (Syscall::unshare, 0, clone_newuser), (Syscall::clone, 0, clone_newuser), // TODO: ioctl causes an error when running /bin/ash somehow... // (Syscall::ioctl, 1, TIOCSTI), ]; match Context::init_with_action(Action::Allow) { Ok(mut ctx) => { for syscall in DISABLED_SYSCALLS { if let Err(err) = ctx.set_action_for_syscall(Action::Errno(0), syscall) { return Err(ContainerError::DisableSyscall); }; } for (syscall, arg_idx, bit) in conditional_syscalls { if let Err(err) = ctx.set_rule_for_syscall( Action::Errno(1000), syscall, &[Comparator::new(arg_idx, Cmp::MaskedEq, bit, Some(bit))], ) { return Err(ContainerError::DisableSyscall); } } if let Err(err) = ctx.load() { return Err(ContainerError::DisableSyscall); }; } Err(err) => { return Err(ContainerError::DisableSyscall); } } Ok(()) } }
seccomp_rule_add_array
allows developers to filter a syscall based on specific argument values by providing a comparator. Here is the code I used to perform conditional filters:
#![allow(unused)] fn main() { ctx.set_rule_for_syscall( Action::Errno(1000), syscall, &[Comparator::new(arg_idx, Cmp::MaskedEq, bit, Some(bit))], ) }
For example, to error our when unshare
is invoked when it contains the clone_newuser
bit, we can provide a Comparator
to set_rule_for_syscall
like this:
#![allow(unused)] fn main() { let clone_newuser = CloneFlags::CLONE_NEWUSER.bits() as u64; ctx.set_rule_for_syscall( Action::Errno(1000), Syscall::unshare, &[Comparator::new(0, Cmp::MaskedEq, clone_newuser, Some(clone_newuser))], ); }
Testing the Implementation
Now, let’s test whether our implementation works. In this test, we will confirm that performing unshare
works without the CLONE_NEWUSER
flag but fails with the CLONE_NEWUSER
flag.
First, let’s confirm that unshare
works when there are no flags set. Here is the unshare_test
program:
use nix::sched::{unshare, CloneFlags}; fn main() { match unshare(CloneFlags::empty()) { Ok(_) => println!("Unshared success!"), Err(e) => println!("Error: {:?}", e), } }
After compiling the binary for unshare_test
, we need to copy the executable into the alpine
directory before running the program in the container.
# inside the unshare_test repo
RUSTFLAGS="-C target-feature=+crt-static" cargo build --target="aarch64-unknown-linux-gnu"
cp target/aarch64-unknown-linux-gnu/debug/unshare_test /home/brianshih/alpine
# navigate to mini-container repo
sudo target/debug/mini-container /unshare_test /home/brianshih/alpine
# Unshared Success!
Based on the output of running the executable in the container environment, we’ve confirmed that unshare
works when there are no flags set.
Now, let’s see what will happen if performing unshare
with the CLONE_NEWUSER
flag works with the following code:
use nix::sched::{unshare, CloneFlags}; fn main() { match unshare(CloneFlags::CLONE_NEWUSER) { Ok(_) => println!("Unshared success!"), Err(e) => println!("Error: {:?}", e), } }
After compiling and copying the executable to the target root filesystem, I ran the executable in the container environment:
sudo target/debug/mini-container /unshare_test /home/brianshih/alpine
# Error: UnknownErrno
Based on the output, we have confirmed that it works.
To check which Seccomp
mode and how many seccomp filters
there are, you can perform grep Seccomp /proc/{pid}/status
like follows:
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
# ...
# Child pid: Pid(6381)
# Host system
grep Seccomp /proc/6381/status
# Seccomp: 2
# Seccomp_filters: 1
Here, we can see that Seccomp
is in the filter mode and there is one filter since our code only initializes and loads one filter.
Additional Resources
Intro to Seccomp and Seccomp-bpf
Capabilities
Goal
We want to granularly control and limit the privileges of processes within a container.
Theory
Traditionally, processes run with either a full set of privileges granted by the root user or with a limited set of privileges granted by the process’s user and groups. However, sometimes a program needs to be run by an unprivileged user but make privileged calls. One way to allow that is to set the suid bit on the file, which will cause the file to be executed by the user who owns the file. This makes the program susceptible to privilege escalation attacks.
Linux Capabilities are introduced as a mechanism that allows a process to perform privileged operations without being granted superuser access. Rather than a single privilege, the superuser privilege is divided into distinct units known as capabilities.
Rules of Capabilities
In Linux, both processes and files (executables) can have capabilities. So what capabilities are granted when a file is executed by a process? For that, we need to first introduce the concept of capabilities set.
Each process stores 5 different sets of capabilities (based on the “Thread capability sets” section in the Linux doc):
- Effective: The kernel will run permission checks against effective capabilities. If the capability for a privileged operation is not set, a permission error will be thrown.
- Permitted: superset for the effective capabilities. The process can transition it to the effective set dynamically.
- Inheritable: capabilities inside the inheritable set will be added to the permitted set when a program is executed via the
execve
syscall - Bounding: the superset of all the capabilities. If a capability is not inside the bounded set, it is not allowed
- Ambient: a set of capabilities preserved across an execve call that is not privileged. No capability can be ambient if it is not both permitted and inheritable.
Here is a screenshot from the Linux doc about how the different Linux capabilities will transform across execve
calls:
If a user wants to execute a file that needs capability X
, the user needs X to be inside P'(effective)
. In the 2 demos below, we will demonstrate how we can achieve that for different types of files.
Demo
Demo 1: Gaining Capabilities from Executables
One of the Linux Capabilities is CAP_NET_BIND_SERVICE
, which determines whether a process can bind a socket to an Internet domain privileged port (port number less than 1024).
To start, I’ve created a Rust project with the following code. All this code snippet does is that it tries to create a TcpListener
and bind it to a privileged address (80).
use std::net::TcpListener; fn main() { let listener = TcpListener::bind("127.0.0.1:80").unwrap(); println!("TcpListner bound to 127.0.0.1:80. Accepting incoming connection"): listener.accept(); }
When we run this code, we will get this error:
#![allow(unused)] fn main() { Error: Os { code: 13, kind: PermissionDenied, message: "Permission denied" } }
This is because normal processes have 0 capabilities. To verify this, we can look at /proc/$$/status
to see that the CAP_NET_BIND_SERVICE
bit is not in CapEff
.
grep Cap /proc/$$/status
# CapInh: 0000000000000000
# CapPrm: 0000000000000000
# CapEff: 0000000000000000
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000000
capsh --decode=000001ffffffffff
# 0x000001ffffffffff=...cap_net_bind_service,cap_net_broadcast...
Now, let’s think about how we can grant capability to the process running the file.
Firstly, the file is clearly not capability-aware. Capability aware programs are programs that understand and manipulate capabilities through calls to libcap syscalls.
Therefore, in order for the CAP_NET_BIND_SERVICE
capability to be inside the thread’s effective capability set after the execve
call, one way is to add the capability to the file’s effective set and permitted set.
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset) | P'(ambient)
If F(effective)
is valid, we can perform the following algebra:
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
⇒ P'(effective) = P'(permitted)
⇒ P'(effective) = (F(permitted) & cap_bset)
Since the capability is inside F(effective)
and F(premitted)
, it will also be inside P'(effective)
.
Now let’s try setting the CAP_NET_BIND_SERVICE
to the file and re-run it.
sudo setcap 'cap_net_bind_service=+ep' target/debug/hello_world
getcap target/debug/hello_world
# target/debug/hello_world cap_net_bind_service=ep
target/debug/hello_world
# TcpListener bound to 127.0.0.1:80. Accepting incoming connection
To grant a capability, we will use the setcap syscall. To verify that the capability is set, we use the getcap syscall. After setting the capability, we can bound the TcpListener to port 80.
Demo 2: Capability-aware files
Ideally, we would like to create an environment that doesn’t require giving the process root user privileges or granting the file capabilities.
Let’s look at this equation again:
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
If we don’t set the F(effective)
bit, then we need to ensure that P'(ambient)
contains the capability bit. To do that, we need to create a capability-aware file. Capability-aware files can use the prctl calls to add capabilities to capability sets.
For example, prctl
with arguments of PR_CAP_AMBIENT
PR_CAP_AMBIENT_RAISE
can add capabilities to the ambient set. According to prctl’s Linux doc, PR_CAP_AMBIENT_RAISE
adds the capability specified in arg3 to the ambient set, and “the specified capability must already be present in both the permitted and the inheritable sets of the process”.
As a result, we need to add the capability to the inheritable set of the thread before adding it to the ambient set of the thread. We will add the capability to F(permitted)
manually since I can’t seem to add it with prctl
directly (I’m still going through the docs to find out why this is happening!).
Here is the set-ambient
program (inspired by this blog) to do that:
use std::{env, ffi::CString}; use nix::unistd::execve; fn set_ambient() { caps::raise( None, caps::CapSet::Inheritable, caps::Capability::CAP_NET_BIND_SERVICE, ) .unwrap(); caps::raise( None, caps::CapSet::Ambient, caps::Capability::CAP_NET_BIND_SERVICE, ) .unwrap(); } fn main() { let args: Vec<String> = env::args().collect(); set_ambient(); println!("CAP_NET_BIND_SERVICE is in ambient capabilities. Executing file."); if let Err(e) = execve::<CString, CString>(&CString::new(args[1].clone()).unwrap(), &[], &[]) { println!("Failed to execve: {:?}", e); } }
We use the caps crate to set the capabilities. The call caps::raise(None, Ambient, CAP_NET_BIND_SERVICE)
is a wrapper around the Linux call prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, NET_BIND_SERVICE)
.
As specified earlier, the capability must be present in both the permitted and the inheritable sets of the process. Therefore, we use sudo setcap
to add the capability to the permitted set of the file.
After setting the capability bit for NET_BIND_SERVICE
to the permission capability set of the file, let’s run /bin/bash
with the set-ambient
program. We can check the capability sets of the process via grep Cap /proc/$$/status
and see that the effective bits for the process are 0000000000000400
. Finally, we can use capsh --decode
to confirm that cap_net_bind_service
is in the process’s effective set.
sudo setcap cap_net_bind_service+p target/debug/set-ambient
target/debug/set-ambient /bin/bash
# CAP_NET_BIND_SERVICE is in ambient capabilities. Executing file.
grep Cap /proc/$$/status
# CapInh: 0000000000000400
# CapPrm: 0000000000000400
# CapEff: 0000000000000400
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000400
capsh --decode=0000000000000400
# 0x0000000000000400=cap_net_bind_service
Finally, we can run the file with the TcpListener
again and this time, we can bound the listener to port 80.
target/debug/set-ambient ../tcp_example/target/debug/tcp_example
# TcpListener bound to 127.0.0.1:80. Accepting incoming connection
Implementation
My implementation takes in a list of capabilities to add and a list of capabilities to drop. If ALL
is specified in cap-drop
, then all capabilities are dropped.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
--cap-drop ALL
--cap-add NET_BIND_SERVICE CAP_SETUID
Here is the pseudocode for the implementation:
- for each capability to drop, drop them. If the capability specified is
ALL
, then loop through any capabilities in the bounding set unless it’s inside the capabilities to add - loop through the capabilities to add and add the capability set to the inheritable set and the ambient set.
Here is the actual code:
#![allow(unused)] fn main() { static CAPABILITIES: phf::Map<&'static str, Capability> = phf_map! { "NET_BIND_SERVICE" => caps::Capability::CAP_NET_BIND_SERVICE, "SETUID" => caps::Capability::CAP_SETUID, "CAP_SYS_TIME" => caps::Capability::CAP_SYS_TIME, }; fn capabilities(config: &ChildConfig) -> ContainerResult { // compute the list of capabilities to add let caps_add: Vec<Capability> = match &config.cap_add { Some(cap_add) => { let mut res = vec![]; for c in cap_add.iter() { match CAPABILITIES.get(c) { Some(c) => { res.push(c.clone()); } None => { return Err(ContainerError::CapabilityAdd); } } } res } None => vec![], }; // if ALL is inside the capabilities to drop, then drop all capabilities except // for the ones inside capabilities to add if let Some(caps) = &config.cap_drop { if caps.contains(&String::from("ALL")) { let bounding_caps = caps::read(None, caps::CapSet::Bounding).unwrap(); for cap in bounding_caps.iter() { if !caps_add.contains(cap) { if let Err(e) = caps::drop(None, caps::CapSet::Bounding, *cap) { return Err(ContainerError::CapabilityDrop); } } } } else { for c in caps.iter() { match CAPABILITIES.get(c) { Some(c) => { if let Err(e) = caps::drop(None, caps::CapSet::Bounding, *c) { return Err(ContainerError::CapabilityDrop); } if let Err(e) = caps::drop(None, caps::CapSet::Inheritable, *c) { return Err(ContainerError::CapabilityDrop); } } None => { return Err(ContainerError::CapabilityDrop); } } } } } for cap in caps_add.iter() { if let Err(e) = caps::raise(None, caps::CapSet::Inheritable, *cap) { return Err(ContainerError::CapabilityAdd); } if let Err(e) = caps::raise(None, caps::CapSet::Ambient, *cap) { return Err(ContainerError::CapabilityAdd); } } Ok(()) } }
Testing the Implementation
Let’s first confirm that dropping all capabilities and adding NET_BIND_SERVICE
works.
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
--cap-drop ALL
--cap-add NET_BIND_SERVICE
# Child pid: Pid(6517)
# ...
# host system
grep Cap /proc/6517/status
# CapInh: 0000000000000400
# CapPrm: 0000000000000400
# CapEff: 0000000000000400
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000400
capsh --decode=0000000000000400
# 0x0000000000000400=cap_net_bind_service
Next, I built a Rust program with this code. All it does is print out the capability sets of the process and run setresuid
, which is granted if the SETUID
capability is set.
use nix::unistd::{setresuid, Uid}; fn main() { println!("Effective {:?}", caps::read(None, caps::CapSet::Effective)); println!("Bounding {:?}", caps::read(None, caps::CapSet::Bounding)); println!( "Inherited {:?}", caps::read(None, caps::CapSet::Inheritable) ); println!("Permitted {:?}", caps::read(None, caps::CapSet::Permitted)); println!("Ambient {:?}", caps::read(None, caps::CapSet::Ambient)); if let Err(e) = setresuid(Uid::from_raw(10), Uid::from_raw(10), Uid::from_raw(10)) { println!("Failed to setuid: {:?}", e); } println!("Finished"); }
Next, let’s compile it and copy it to the alpine directory. Then we run the program in the container. We get an EPERM
error. If we look at the logged lines, we can see that CAP_SETUID
is not in the effective set of the process.
# compile it
RUSTFLAGS="-C target-feature=+crt-static" cargo build --target="aarch64-unknown-linux-gnu"
# copy it to the alpine directory
cp target/aarch64-unknown-linux-gnu/debug/setuid_example /home/brianshih/alpine
sudo target/debug/mini-container /setuid_example /home/brianshih/alpine
# Effective Ok({})
# Bounding Ok({CAP_SETGID, CAP_AUDIT_WRITE, CAP_SYS_RESOURCE, CAP_SETFCAP, CAP_BLOCK_SUSPEND, CAP_SYS_TTY_CONFIG, CAP_AUDIT_CONTROL, CAP_SYS_NICE, CAP_CHOWN, CAP_LEASE, CAP_MAC_OVERRIDE, CAP_FOWNER, CAP_BPF, CAP_SYS_BOOT, CAP_WAKE_ALARM, CAP_NET_BIND_SERVICE, CAP_IPC_OWNER, CAP_NET_BROADCAST, CAP_PERFMON, CAP_FSETID, CAP_SYS_ADMIN, CAP_SYSLOG, CAP_LINUX_IMMUTABLE, CAP_KILL, CAP_NET_ADMIN, CAP_DAC_READ_SEARCH, CAP_SYS_CHROOT, CAP_SYS_PACCT, CAP_SYS_RAWIO, CAP_SETUID, CAP_NET_RAW, CAP_AUDIT_READ, CAP_CHECKPOINT_RESTORE, CAP_SYS_TIME, CAP_MKNOD, CAP_SYS_PTRACE, CAP_MAC_ADMIN, CAP_DAC_OVERRIDE, CAP_IPC_LOCK, CAP_SETPCAP, CAP_SYS_MODULE})
# Inherited Ok({})
# Permitted Ok({})
# Ambient Ok({})
# Failed to setuid: EPERM
However, if we rerun the program with --cap-add SETUID
, the program runs without error. If we look at the logged lines, we can see that CAP_SETUID
is in the effective capability set of the process.
sudo target/debug/mini-container /setuid_example /home/brianshih/alpine
--cap-add SETUID
# Effective Ok({CAP_SETUID})
# Bounding Ok({CAP_SETFCAP, CAP_BPF, CAP_MKNOD, CAP_CHOWN, CAP_SETUID, CAP_SYS_TIME, CAP_FSETID, CAP_NET_ADMIN, CAP_SYS_CHROOT, CAP_LINUX_IMMUTABLE, CAP_IPC_LOCK, CAP_SYS_NICE, CAP_SYS_RAWIO, CAP_SETGID, CAP_KILL, CAP_DAC_OVERRIDE, CAP_CHECKPOINT_RESTORE, CAP_SYS_PACCT, CAP_SYS_PTRACE, CAP_MAC_ADMIN, CAP_WAKE_ALARM, CAP_AUDIT_WRITE, CAP_MAC_OVERRIDE, CAP_LEASE, CAP_SYS_RESOURCE, CAP_IPC_OWNER, CAP_FOWNER, CAP_SYS_MODULE, CAP_BLOCK_SUSPEND, CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_PERFMON, CAP_SYSLOG, CAP_NET_RAW, CAP_SYS_ADMIN, CAP_NET_BROADCAST, CAP_SYS_TTY_CONFIG, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_DAC_READ_SEARCH, CAP_SYS_BOOT})
# Inherited Ok({CAP_SETUID})
# Permitted Ok({CAP_SETUID})
# Ambient Ok({CAP_SETUID})
Additional Resources
- Linux capabilities - why they exist and how they work
- Linux capabilities in practice
- Redhat blog - Linux Capabilities
User Namespace
Goal
The best way to prevent privilege-escalation attacks from within a container is to run the container’s executable as an unprivileged user. However, some applications require the process to run as a root
user. Therefore, our goal is to set up an environment such that the user is privileged within the container but unprivileged to the host system.
Theory
User Namespaces isolate security-related identifiers. According to Linux’s doc, “a process’s user and group IDs can be different inside and outside a namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace”.
The user namespace is what enables a container to run as a root
user within a container but have unprivileged access outside the container, which prevents privilege-escalation attacks.
An important property of user namespaces is that it is nested. Apart from the root namespace, each namespace has a parent namespace, which is the user namespace of the process that created the user namespace via a call to unshare
or clone
with the CLONE_NEWUSER
flag.
User mappings
User mappings are what allow a process's user IDs to be different inside and outside a namespace.
When a user namespace is created, it starts without a mapping of User IDs to the parent user namespace. The /proc/pid/uid_map
, which resides in the parent user namespace, maps the User IDs inside the parent user namespace to the User IDs inside the child user namespace.
Each line in the uid_map
takes the form:
#![allow(unused)] fn main() { ID-in-child-ns ID-in-parent-ns length }
ID-in-child-ns
, ID-in-parent-ns
, and length
specify that a range of user IDs of length
starting from ID-in-child-ns
are mapped to a range of user IDs of length
in the parent user namespace starting with ID-in-parent-ns
.
For example, a line of 0 1000 1
means that the user with User ID 0
in the child user namespace maps to the user with User ID 1000
.
Demo
In the demo below, we first create a user namespace with the -U
flag. According to the Linux doc, an unmapped User ID is converted to the overflow user ID which is 65534
. This is why the uid=65534
. However, when we check the User ID
for the process via ps -o 'pid uid user command' -a
, we can see that the UID is 1000
, the same User as the parent process’s user.
After retrieving the pid
of the child process via echo $$
, we write 0 1000 1
into the uid_map
of the parent user namespace. We then check the user ID
of the child process and now see 0
.
Even though the User ID of the child process is 0
, if we check the uid
from the parent user namespace, it’s still 1000. We have successfully mapped the original user ID of 1000
to 0
in the new user namespace.
id
# uid=1000(brianshih) gid=1000(brianshih) groups=1000(brianshih) ...
unshare -U
id
# uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
echo $$
# 10087
# host system
ps -o 'pid uid user command' -a
# PID UID USER COMMAND
10087 1000 briansh+ -bash
echo '0 1000 1' > /proc/10087/uid_map
# in the child process
id
# uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
# host system
ps -o 'pid uid user command' -a
# PID UID USER COMMAND
# 10087 1000 briansh+ -bash
Implementation
The implementation is split into two portions:
user_ns
: creating a new user namespace in the child processhandle_child_uid_map
: updating to theuid_map
in the parent process.
We can see that in the run
method below, we handle_child_uid_map
method is run after create_child_process
. This is because to write to the uid_map
, the parent process needs the newly created process’s pid
.
#![allow(unused)] fn main() { fn run() -> ContainerResult { ... let child_pid = create_child_process(&config)?; handle_child_uid_map(child_pid, parent_socket.as_raw_fd(), config.user_id.clone())?; ... } }
Here is the code to create a new user namespace with unshare
. After creating the new user namespace, the child process notifies the parent process that the child process has created a new user namespace. Next, the child process waits until the parent updates the uid_map
before using setresuid
to set the new user_id
, which is likely 0
.
#![allow(unused)] fn main() { fn user_ns(config: &ChildConfig) -> ContainerResult { if let Err(e) = unshare(CloneFlags::CLONE_NEWUSER) { println!("Failed to unshare with new user namespace: {:?}", e); return Err(ContainerError::UnshareNewUser); } // Notifies the parent process that the child process has created a new user namespace socket_send(config.socket_fd)?; // Wait for the parent process to update the uid_map before setting the uid socket_recv(config.socket_fd)?; if let Some(user_id) = config.user_id { println!("Setting UID to: {:?}", config.user_id); if let Err(e) = setresuid( Uid::from_raw(user_id), Uid::from_raw(user_id), Uid::from_raw(user_id), ) { println!("Failed to set uid. Error: {:?}", e); return Err(ContainerError::SetResuid); }; } Ok(()) }
Here is the code for how the parent updates the uid_map
. It first waits for the user to create a user namespace via socket_recv
. It then writes to the uid_map
file and gid_map
file. Finally, it uses socket_send
to notify the child that the uid_map
is updated.
#![allow(unused)] fn main() { fn handle_child_uid_map(pid: Pid, fd: i32, user_id: Option<u32>) -> ContainerResult { // Wait for the user to create a user namespace socket_recv(fd)?; let user_id = match user_id { Some(id) => id, None => 0, // default to run as root if no user ID is provided }; println!("Updating uid_map"); match File::create(format!("/proc/{}/{}", pid.as_raw(), "uid_map")) { Ok(mut uid_map) => { if let Err(e) = uid_map.write_all(format!("0 {} {}", 1000, 1000).as_bytes()) { println!("Failed to write to uid_map. Error: {:?}", e); return Err(ContainerError::UidMap); } } Err(e) => { println!("Failed to create uid_map. Error: {:?}", e); return Err(ContainerError::UidMap); } } match File::create(format!("/proc/{}/{}", pid.as_raw(), "gid_map")) { Ok(mut uid_map) => { if let Err(e) = uid_map.write_all(format!("0 {} {}", 1000, 1000).as_bytes()) { println!("Failed to write to uid_map. Error: {:?}", e); return Err(ContainerError::UidMap); } } Err(e) => { println!("Failed to create uid_map. Error: {:?}", e); return Err(ContainerError::UidMap); } } println!("Finished updating uid_map. Notifying child process"); // Notify the user that the uid_map is updated socket_send(fd)?; Ok(()) } }
Testing the Implementation
We create the container environment and set the new user ID to 0 via --user 0
. We then confirm that id
inside the container is 0
.
In the host system, we confirm that the process used to run the executable, /bin/ash
has a UID of 1000
. This confirms that the uid_mapping
worked.
id
# uid=1000(brianshih) ...
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine --user 0
id
# uid=0(root)
# host system
ps -o 'pid uid user command' -a
# PID UID USER COMMAND
# 10074 0 root target/debug/mini-container /bin/ash /home/brianshih/alpi
# 10075 1000 briansh+ /bin/ash
Additional Resources
- Blog about docker security - user namespace
- Namespaces in operation, part 5: User namespaces
- Docker blog - Isolate containers with a user namespace
- Demo of how user namespace works with Docker
Resource Restrictions
Goal
We want to limit and isolate resource usage such as CPU, memory, disk I/O, network, etc in a container.
Theory
Cgroups is a Linux kernel feature that allows developers to control how much of a given key resource (CPU, memory, etc) a process or a set of processes can access.
According to the Linux doc, the grouping of processes is provided through a pseudo-filesystem called cgroupfs
. A cgroup is a collection of processes bound to a set of limits defined via the cgroup filesystem.
Each cgroup has a kernel component called a subsystem, also known as a resource controller.
Different subsystems limit different resources, such as the CPU time and memory available to a cgroup. To create a cgroup, you create a directory inside the cgroup
filesystem:
mkdir /sys/fs/cgroup/cg1
Each file inside the cgroup
directory corresponds to a different resource that can be limited. For example, the cgroup
below contains files such as memory.max
which limits the memory a cgroup can access.
ls /sys/fs/cgroup/cg1
# cgroup.controllers cpuset.cpus.partition memory.max
# cgroup.events cpuset.mems memory.min
# cgroup.freeze cpuset.mems.effective memory.numa_stat
# cgroup.kill cpu.stat memory.oom.group
# cgroup.max.depth cpu.uclamp.max memory.peak
# cgroup.max.descendants cpu.uclamp.min memory.pressure
# cgroup.pressure cpu.weight memory.reclaim
# ... many more
Demo
In this demo (inspired by Michael Kerrisk’s tech talk), we will create a cgroup and set pids.max
to 5 and confirm that the process can only run 5 tasks at max.
sudo bash
cd /sys/fs/cgroup/
# we create a cgroup called foo
mkdir foo
# add the current process to the created cgroup
echo $$ > foo/cgroup.procs
# confirm that the current process belongs to the foo cgroup
cat /proc/$$/cgroup
# 0::/foo
# set the maximum number of tasks at once
echo 5 > /sys/fs/cgroup/foo/pids.max
for i in {1..5}; do sleep 1 & done
# [1] 8379
# [2] 8380
# [3] 8381
# [4] 8382
# bash: fork: retry: Resource temporarily unavailable
After creating a new cgroup
called foo
and adding the process into that cgroup, we set pids.max
to 5
. Next, we execute for i in {1..5}; do sleep 1 & done
and see that when the process tries to run the 5th sleep 1
, it errors out as the process cannot create 5 processes.
Implementation
There are many resources that we can choose to limit. For my toy container implementation, I will only limit the memory
and max_pids
. In the implementation, we will use the cgroup-rs crate, a Rust library for managing cgroups.
Note that limiting the resources is performed by the parent process after the child process is created. This is because we need the child process’s pid
so that we can add it to the cgroup
.
#![allow(unused)] fn main() { fn run() -> ContainerResult { ... let child_pid = create_child_process(&config)?; resources(&config, child_pid)?; ... } }
The code for limiting resources is simple. We create a new cgroup
with the config.hostname
as its name. We then write to the corresponding resource’s file before adding the pid
to the created cgroup
.
#![allow(unused)] fn main() { fn resources(config: &ChildConfig, pid: Pid) -> ContainerResult { println!("Restricting resource!"); let mut cg_builder = CgroupBuilder::new(&config.hostname); if let Some(memory_limit) = config.memory { println!("Setting memory limit to: {:?}", memory_limit); cg_builder = cg_builder.memory().memory_hard_limit(memory_limit).done(); } if let Some(max_pids) = config.max_pids { cg_builder = cg_builder .pid() .maximum_number_of_processes(cgroups_rs::MaxValue::Value(max_pids)) .done(); } let cg = cg_builder.build(Box::new(V2::new())); let pid: u64 = pid.as_raw() as u64; if let Err(e) = cg.add_task(CgroupPid::from(pid)) { println!("Failed to add task to cgroup. Error: {:?}", e); return Err(ContainerError::CgroupPidErr); }; Ok(()) }
Testing the Implementation
This is the code snippet we will use to test whether limiting the number of pids in a cgroup works. This is basically a Rust implementation of our demo earlier: for i in {1..5}; do sleep 1 & done
.
use std::thread; use std::time::Duration; fn main() { for i in 1..=5 { thread::spawn(move || { println!("Thread {} started", i); thread::sleep(Duration::from_secs(1)); println!("Thread {} completed", i); }); } // Sleep for a while to allow threads to finish. thread::sleep(Duration::from_secs(2)); }
When we run the executable, we get a Resource temporarily unavailable
message. If we examine the hostname and check /sys/fs/cgroup/mini-JoYUGNc/pids.max
, we can see that it’s 5
. We can also check which cgroup
the child process is to verify that it’s added to the cgroup
correctly.
sudo target/debug/mini-container /sleep_test /home/brianshih/alpine
--nproc 5
# thread 'main' panicked at 'failed to spawn thread: Os
# { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
hostname
# hostname of child process: mini-JoYUGNc
# host system
cat /sys/fs/cgroup/mini-JoYUGNc/pids.max
# 5
# pid of child process is 8428
cat /proc/8428/cgroup
# 0::/mini-OhMDCDW
Next, we run the same command without the --nproc 5
option:
#![allow(unused)] fn main() { sudo target/debug/mini-container /sleep_test /home/brianshih/alpine }
This time, it ran successfully, confirming that our cgroup implementation worked.
Additional Resources
Blog: What are Namespaces and cgroups?
Blog: Deep into Containers (Namespace & CGroups)
Future Work
There are a lot of container features I would love to explore in the future, such as:
- networking: I would love to learn how systems like Docker enable containers to communicate with each other and outside the world.
- image layers: I would love to learn how Docker images work under the hood and how the Docker engine's cache ensures the efficient creation of Docker images.
- hacking a container: I would love to learn how to attack a container to understand container guarantees better.