Preface

I’ve been using Docker Containers for many years, but I’ve always treated them as magical black boxes. I know that Docker uses container runtimes like runc (by default) to create isolated environments to run code. However, I don’t know what “isolated” really means. To unveil the black box, I decided to implement a toy container runtime from scratch in Rust.

Luckily, there are a ton of tutorials and resources online that I can learn from. My implementation is largely based on these two blogs in particular: Linux Containers in 500 Lines of Code & Writing a Container in Rust. As someone who knew very little about Linux, the experience of building a container is extremely eye-opening and rewarding.

Here is a summary of what we will build:

root filesystem isolation with mount namespace
resource restriction with cgroups
limit syscalls with seccomp
isolate user IDs and group IDs with user namespace and uid mapping
privilege control with capabilities

In this blog series, I will cover the theory behind and the implementation of a container from the perspective of someone new to Linux. I will also provide as many demos as possible to demonstrate how the Linux primitives that make up a container work.

The full source code is available here.

What exactly is a Container?

The concept of containers is rooted in Linux. Check out this RedHat blog about the history of containers. When people talk about containers, they are more or less talking about Linux containers.

However, the Linux Kernel doesn’t have a native object that represents a “container”. From the perspective of the kernel, containers are just processes. But what makes these processes special?

The best way to look at the properties of a process in a container is to look at some demos with the help of Docker, a tool that can create and run containers.

Filesystem Isolation

Firstly, a process in a container has an isolated view of the filesystem. In the demo below, we created a container based on the ubuntu image.

If we navigate to the root directory via cd /, we notice that the root filesystem of the process in a container is not the same one as the root filesystem on the host system. Modifying the root filesystem within the container will have no impact on the host system.

docker run -it ubuntu bash
cd /
ls
# bin  boot  dev  etc  home  lib  media  mnt  opt  proc
#  root  run  sbin  srv  sys  tmp  usr  var

# host system
cd /
ls
# bin    dev   lib         mnt   opt   run   srv       tmp
# boot   etc   lost+found  proc  sbin  swapfile  usr
# cdrom  home  media       root  snap  sys       var

The new root filesystem comes from the ubuntu image. A docker image is an executable file. A docker image is made up of filesystems layered over each other. These layers form the base for a container’s root filesystem.

Pid Isolation

Processes in a container have an isolated view of other processes running on the host. In the example below, if we perform ps -a -u to list all processes in the container, we only see the process running bash and ps -a -u. However, if we perform ps -a -u on the host system, we see a lot more processes.

Furthermore, in the example below the process perceives its pid as 1. However, from the perspective of the host system, the process running bash is 6098.

docker run -it ubuntu bash
ps -a -u
# USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
# root           1  0.2  0.0   4136  3200 pts/0    Ss   07:19   0:00 bash
# root           9  0.0  0.0   6412  2432 pts/0    R+   07:19   0:00 ps -a -u
echo $$
# 1

# host system
ps -a -u
# USER     PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
root       6098  0.0  0.0   4136  3200 pts/0    Ss+  15:19   0:00 bash

User ID Isolation

Processes in a container have an isolated view of things like user IDs and group IDs. This enables a process to run as different users inside and outside the container.

In the example below, we enable the user namespace via --userns-remap=default. The process in the container perceives its uid as 0. But if we look at the user corresponding to the process from the host system, the user is 165536.

sudo dockerd --userns-remap=default
sudo docker run -it --rm busybox /bin/sh
id
# uid=0(root) gid=0(root) groups=0(root),10(wheel)

# host system
ps -a -u
# USER        PID   %CPU %MEM    VSZ   RSS TTY     STAT  START TIME  COMMAND
# ...
# 165536     14154  0.0  0.0   3984  1920 pts/0    Ss+  14:33   0:00 /bin/sh

Resource Restriction

In Docker, you can constrain resources that the container can access. For example, you can limit the amount of memory the process can take, the number of CPUs the container can run on, etc. Check out Docker’s doc for the full list of resources that can be constrained.

As an example, here is how you can limit the container to have a memory limit of 128 mb.

docker run -it --memory 128m ubuntu bash

Secret behind Containers

The secret behind how a container can provide the isolation properties demonstrated above boils down to the following Linux primitives:

Namespaces
Capabilities
cgroups

We will cover these in greater detail throughout the blog!

API

Before we talk about the theory of and implementation behind containers, let’s first look at the API for my toy container.

At the core, the mini-container program takes two arguments: an executable program and a directory that points to a root filesystem. It creates a process, sets up the container environment for the process, and executes the executable program in this container.

Here are the arguments and options to execute my toy container.

mini-container [OPTIONS] <PATH_TO_EXECUTABLE> <ROOT_FILESYSTEM_PATH>

Arguments:

<COMMAND> Command to execute
<ROOT_FILESYSTEM_PATH> Absolute path to the new root filesystem

Options:

-p, --pid <PID> Set the pid for child process

-m, --memory <MEMORY> Memory limit (megabytes)

--nproc <NPROC> Max pids allowed

-u, --user <USER> Set the User ID for child process

--cap-add <CAP_ADD> Add Linux capabilities to the container environment

--cap-drop <CAP_DROP> Drop Linux capabilities to the container environment. Specify “ALL” to drop all

-h, --help

Examples

Running an interactive bash shell

To run an interactive bash shell in the container environment, you first need to set up a directory that will serve as the root filesystem for the container. This is equivalent to an image in Docker, which contains a minimal OS. For all my demos, I will be using Alpine’s Mini Root Filesystem image.

First, we download the image and extract it into the alpine directory.

cd /home/brianshih
# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine

Next, we can launch the container and execute /bin/ash. Note that the alpine directory will become the new root filesystem.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine

Here is the rough equivalent command in docker:

docker exec -it alpine bash

Limiting resources in the container

You can run a container with limited memory and limited process capacity via the --nproc and --memory options.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine 
	--nproc 5 --memory 1048

Here is the rough equivalent command in docker - though unlike my implementation, nproc in Docker sets the maximum number of processes available to a user, not to a container.

docker run --memory="1048m" --ulimit nproc=5 IMAGE

Dropping and Adding Linux Capabilities

Here is how you can drop all the Linux capabilities and add the NET_BIND_SERVICE capability. Note that for my toy implementation, I only support 3 capabilities (so far). It’s extremely trivial to add them but my goal isn’t to build a production-level container so I stopped whenever I felt like I understood how they work.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine 
	--cap-drop ALL 
	--cap-add NET_BIND_SERVICE

Here is the rough equivalent command in docker:

docker run --cap-drop all --cap-add NET_BIND_SERVICE alpine

Setting the User ID

Here is how you can set the user ID for the process.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine --user 0

Here is the rough equivalent command in docker:

docker run --rm --user $UID:$GID alpine ash

Project Overview

Here is the higher-level setup for this project:

parse the command line argument
create the child process
setup the namespaces, capabilities, and syscalls restrictions
executing the program

Parse the command line argument

To parse the command line arguments, we use the clap crate. Here is the struct representation of the parsed arguments:

#![allow(unused)]
fn main() {
#[derive(Parser)]
struct Cli {
    /// Command to execute
    command: String,

    /// Absolute path to new root filesystem
    root_filesystem_path: String,

    /// Optional pid for child process
    #[arg(short, long)]
    pid: Option<u32>,

    /// Memory limit (megabytes)
    #[arg(short, long)]
    memory: Option<i64>,

    /// Memory limit (megabytes)
    #[arg(long)]
    nproc: Option<i64>,

    /// Memory limit (megabytes)
    #[arg(short, long)]
    user: Option<u32>,

    // Add capabilities to the bounding set
    #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')]
    cap_add: Option<Vec<String>>,

    // Remove capabilities to the bounding set, or all if the String provided is "ALL"
    #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')]
    cap_drop: Option<Vec<String>>,
}
}

The entry point of the project is the run method. All we have to do is call Cli::parse() to parse the arguments

fn main() {
    if let Err(_) = run() {
        cleanup();
        exit(-1);
    }
}

fn run() -> ContainerResult {
    let cli = Cli::parse();
	  ...
}

Create the child process

Since a container is just a process, we need to create the child process for the container. The create_child_process function is responsible for that.

#![allow(unused)]
fn main() {
fn run() -> ContainerResult {
    let cli = Cli::parse();

	  ...
    let child_pid = create_child_process(&config)?;
    if let Err(e) = waitpid(child_pid, None) {
        return Err(ContainerError::WaitPid);
    };
    Ok(())
}
}

After creating the child process, we need to make sure the parent process doesn't terminate until the child process completes. We use the waitpid call to make sure of that.

Here is the implementation for create_child_process:

#![allow(unused)]
fn main() {
// Creates a child process with clone and runs the executable file
// with execve in the child process.
fn create_child_process(config: &ChildConfig) -> Result<Pid, ContainerError> {
    let mut flags = CloneFlags::empty();
    flags.insert(CloneFlags::CLONE_NEWNS);
    flags.insert(CloneFlags::CLONE_NEWCGROUP);
    flags.insert(CloneFlags::CLONE_NEWPID);
    flags.insert(CloneFlags::CLONE_NEWIPC);
    flags.insert(CloneFlags::CLONE_NEWNET);
    flags.insert(CloneFlags::CLONE_NEWUTS);
    let mut stack = [0; STACK_SIZE];
    let clone_res = unsafe {
        clone(
            Box::new(|| match child(config) {
                Ok(_) => 0,
                Err(_) => -1,
            }),
            &mut stack,
            flags,
            Some(Signal::SIGCHLD as i32),
            // If the signal SIGCHLD is ignored, waitpid will hang until the
            // child exits and then fail with code ECHILD.
        )
    };

    match clone_res {
        Ok(pid) => {
            println!("Child pid: {:?}", pid);
            Ok(pid)
        }
        Err(_) => Err(ContainerError::Clone),
    }
}
}

It uses clone to create the child process. It clones with a bunch of flags such as CLONE_NEWNS, CLONE_NEWPID, etc in order to create the different namespaces (user, mount, pid, etc) necessary for isolation. We will cover these namespaces in more detail later.

The Linux clone method takes a function argument. When the function returns, the child process terminates. The function we pass to clone is the child method whose responsibility is to set up the container environment and execute the user-provided program.

Setup the namespaces, capabilities, and syscalls restrictions & Executing the program

Here is the implementation of child:

#![allow(unused)]
fn main() {
// setup the namespaces, capabilities, syscall restrictions before running the executable
fn child(config: &ChildConfig) -> ContainerResult {
    set_hostname(config)?;
    isolate_filesystem(config)?;
    user_ns(config)?;
    capabilities(config)?;
    syscalls()?;
    match execve::<CString, CString>(&config.exec_path, &config.args, &[]) {
        Ok(_) => Ok(()),
        Err(e) => {
            println!("Failed to execute!: {:?}", e);
            Err(ContainerError::Execve)
        }
    }
}
}

Before using execve to execute the user-provided program, we set up the container environment for the execution by isolating the filesystem, setting up the user namespace, granting and taking away capabilities, and restricting syscalls.

Summary

To summarize, the project contains these core methods:

run: parses the command line arguments. Creates the child process and waits until the child process terminates
create_child_process: uses clone to create the child process. Pass in the child as the function argument to clone
child: sets up the container environment before executing the user-provided program with execve

For the rest of this blog, we will focus on learning how we can set up the container environment for the process. For each component of the container environment, we will break it down into:

Goal
Theory
Demo
Implementation
Testing the Implementation

Isolate Filesystem

Goal

We want to provide a process with an isolated view of the filesystem. In other words, we want to ensure the process cannot touch any files and directories from the host’s filesystem.

Theory

A filesystem is an organized collection of files and directories. Each directory can be backed by a different filesystem. This is the power of the UNIX filesystem abstraction - all directories and files from all filesystems reside under a single directory tree.

To attach a filesystem to a directory, we use the mount command. The directory that we mount to is also known as the mount point.

$ mount device directory

To isolate the filesystem, we need to ensure that the process cannot have access to or modify any mounts of the host system. This is achieved with the help of the mount namespace.

Mount Namespace

According to Linux’s doc, “mount namespaces provide isolation of the list of mounts seen by the processes in each namespace instance. All of the processes that reside in the same mount namespace will see the same view in these files”. Each mount namespace has its own set of mount points, and modifications to the mount points in one namespace do not affect other namespaces.

A new mount namespace can be created using either clone or unshare with the CLONE_NEWNS flag. There are a few things to keep in mind - if the namespace is created from clone, the parent process’s mount namespace will be copied to the child namespace; if the namespace is created from unshare, the caller’s previous mount namespace will be copied to the child namespace. This means that modifying files or directories in a newly created mount namespace can affect the host system.

To achieve isolation, we can use unmount to tear down the root mount. This will not affect the mount list seen in the host system because modifications to the mount list (via mount and unmount) will not affect other mount namespaces.

However, unmounting the root filesystem is usually not allowed because any files open in the root filesystem would prevent the unmount. But even if we manage to unmount the root filesystem, the system would be unusable as the process won’t be able to load any executables or access any devices.

Instead, what we want is to swap out the root filesystem with a new filesystem that contains the minimal required system files and libraries. This is where pivot_root comes in.

pivot_root

pivot_root is a system call that allows us to change the root mount in the mount namespace of the calling process. It takes two directories as arguments - new_root and put_old and it “moves the root mount to the directory put_old and makes new_root the new root mount.” The put_old directory must be at or underneath new_root.

$ pivot_root new_root put_old

Here are the steps to use pivot_root to achieve filesystem isolation for the container:

create the new_root directory that will become the new root filesystem. An empty root filesystem is useless, so we need to put any necessary files to run the application into the new_root directory.
- But how do you determine the “necessary files” to run an application? This is where Docker images become useful - Docker images can be thought of as an archive of root filesystems. We can download an image (like alpine) and extract it into the new_root directory. An image like alpine would not download the entire OS but an essential set of files of alpine.
create a put_old directory inside the new_root directory.
create a new mount namespace with unshare
mount the new_root as Linux requires that the new_root is a mount point before changing the root filesystem
use pivot_root to make new_root the new root filesystem. The put_old directory now points to the original root filesystem.
unmount the put_old filesystem and remove the put_old directory.

After those steps, we have an isolated filesystem. Don’t worry if this seems a bit abstract, I will walk through this in detail in Demo 2 below.

Demo

Demo 1: mount namespace

First, let’s demonstrate that within a mount namespace, mounting or unmounting a filesystem wouldn’t affect other namespaces.

mkdir /tmp/ex
mkdir /tmp/ex/one
sudo unshare -m /bin/bash
mount -t tmpfs tmpfs /tmp/ex
ls /tmp/ex
# empty
mkdir /tmp/ex/foo
ls /tmp/ex
# foo

# From the host system
ls /tmp/ex
# one

In the example above, we created a directory /tmp/ex and a directory /one under it.

Next, we created a new mount namespace with unshare and mounted a tmpfs filesystem onto /tmp/ex.

At this point, /tmp/ex is replaced with a new filesystem. We confirm that it’s no longer related to the filesystem in /tmp/ex in the host system by using ls to list all directories inside /tmp/ex and not seeing the /one directory we created earlier.

To show that modifications to the mounted filesystem have no impact on the host system, we created a directory foo under /tmp/ex. We perform ls /tmp/ex to confirm that foo is inside the directory.

Now when we check what’s inside /tmp/ex from the host system, we only see the original one directory and not the foo directory. This confirms that mounting a filesystem won’t affect other namespaces.

As a side note, if you ever want to see which processes are inside which mount namespace, you can use the ps command or look at /proc/self/ns/mnt like below:

sudo unshare -m /bin/bash

echo $$
# 6766
ps -o pid,mntns,args

# PID    MNTNS     COMMAND
# 6765 4026531841 sudo unshare -m /bin/bash
# 6766 4026532469 /bin/bash
# 6772 4026532469 ps -o pid,mntns,args

ls -l /proc/self/ns/mnt
# lrwxrwxrwx 1 root root 0 Dec 19 16:13 /proc/self/ns/mnt -> 'mnt:[4026532469]'

Demo 2: isolate filesystem with pivot_root

Earlier, we outlined the steps to use pivot_root to achieve filesystem isolation. Let’s put that into practice. We will be using Alpine’s mini root filesystem image as the new root filesystem. Here are the commands:

# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine
cd alpine
echo > I_AM_ALPINE.txt

# create the mount namespace
sudo unshare -m
# make the new_root directory a mount point
mount --bind alpine alpine
# create the put_old directory
mkdir alpine/oldrootfs
cd alpine
# swap out the root filesystem
pivot_root . oldrootfs

cd /
ls
# I_AM_ALPINE.txt.    # bin     etc       lib       mnt       opt       root      sbin      sys       usr
# dev       home      media     old_root  proc      run       srv       tmp       var

ls /oldroot/
# bin         cdrom       etc         lib         media       old         opt         root        sbin        srv         sys         usr
# boot        dev         home        lost+found  mnt         old2        proc        run         snap        swapfile    tmp         var

umount -l old_root/
rmdir old_root/

We first download the alpine image and extract the alpine image into a newly created alpine directory that will serve as the new_root in pivot_root. Next, we create a mount point from the alpine directory.

Next, we need to create the put_old directory for pivot_root under the alpine directory, which is alpine/oldrootfs. Finally, we use pivot_root to swap out the root filesystem.

If we navigate to the root directory via cd / and verify that the root directory is indeed the Alpine filesystem (as it contains I_AM_ALPINE.txt). However, we can still see the old_root directory which points to the original root filesystem. Therefore, we need to unmount it and remove the directory to be isolated from the original filesystem of the host.

We can also verify the mount points in the host system as follows:

# host system. 10920 is the pid of the process with the isolated filesystem
cat /proc/10920/mounts
# /dev/vda2 / ext4 rw,relatime,errors=remount-ro 0 0

cat /proc/10920/mountinfo
# 1066 985 252:2 /home/brianshih/alpine / rw,relatime - ext4 /dev/vda2 rw,errors=remount-ro

We can see that the process with the isolated filesystem only has one mount point whose root is /home/brianshih/alpine.

Implementation

The implementation for my toy container is more-or-less just Demo 2 in the form of Rust code.

Here is a wrapper helper function around the mount system call. Note that according to the Linux doc, if only the directory is provided, then mount modifies an existing mount point.

#![allow(unused)]
fn main() {
// Wrapper around the mount syscall
fn mount_filesystem(
    filesystem_path: Option<&PathBuf>,
    target_directory: &PathBuf,
    flags: Vec<MsFlags>,
) -> ContainerResult {
    let mut mountflags = MsFlags::empty();
    for flag in flags {
        mountflags.insert(flag);
    }
    match mount::<PathBuf, PathBuf, PathBuf, PathBuf>(
        filesystem_path,
        target_directory,
        None,
        mountflags,
        None,
    ) {
        Ok(_) => Ok(()),
        Err(err) => {
            return Err(ContainerError::MountSysCall);
        }
    }
}
}

Here is the code that isolates the filesystem.

#![allow(unused)]
fn main() {
fn isolate_filesystem(config: &ChildConfig) -> ContainerResult {
    mount_filesystem(
        None,
        &PathBuf::from("/"),
        vec![MsFlags::MS_REC, MsFlags::MS_PRIVATE],
    )?;
    let filesystem_path = PathBuf::from("/home/brianshih/alpine");
    mount_filesystem(
        Some(&filesystem_path),
        &filesystem_path,
        vec![MsFlags::MS_BIND, MsFlags::MS_PRIVATE],
    )?;
    let root_filesystem_path = &config.root_filesystem_directory;
    let old_root_path = "oldrootfs";
    let old_root_absolute_path = PathBuf::from(format!("{root_filesystem_path}/{old_root_path}"));
    if let Err(e) = create_dir_all(&old_root_absolute_path) {
        return Err(ContainerError::CreateDir);
    }

    if let Err(e) = pivot_root(&filesystem_path, &PathBuf::from(old_root_absolute_path)) {
        return Err(ContainerError::PivotRoot);
    };
    if let Err(e) = umount2(
        &PathBuf::from(format!("/{old_root_path}")),
        MntFlags::MNT_DETACH,
    ) {
        return Err(ContainerError::Umount);
    }
    if let Err(e) = remove_dir(&PathBuf::from(format!("/{old_root_path}"))) {
        return Err(ContainerError::RemoveDir);
    };
		// Change the directory to the root directory
    if let Err(e) = chdir(&PathBuf::from("/")) {
        return Err(ContainerError::ChangeDir);
    };
    Ok(())
}
}

Something we didn’t cover is the propagation type of a mount point. Each mount point is one of four types: MS_SHARED, MS_PRIVATE, MS_SLAVE, and MS_UNBINDABLE. Mount points of type MS_SHARED are shared across different mounts of the same peer group (learn more about peer groups here). Mount points of type MS_PRIVATE do not propagate events to their peers.

In my code snippet, we recursively set all mount points in the root filesystem to MS_PRIVATE to make sure that no events are propagated to other mount namespaces.

Apart from that, the code is fairly straightforward and reproduces what we did in Demo 2.

Testing the Implementation

To test whether the process has an isolated filesystem, we first create the alpine directory which will serve as the directory for the new root.

Next, we create the container environment and run /bin/ash via sudo target/debug/mini-container /bin/ash /home/brianshih/alpine. /home/brianshih/alpine is the path to the new root filesystem for our container.

After that, we navigate to the root directory and confirm that the root filesystem is the one created from the alpine directory.

# download the alpine image
wget <https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/aarch64/alpine-minirootfs-3.19.0-aarch64.tar.gz>
# create the new_root directory
mkdir alpine
# extract the alpine image into the new_root directory
tar -xvf alpine-minirootfs-3.19.0-aarch64.tar.gz -C alpine
cd alpine
echo > I_AM_ALPINE.txt

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
cd /
ls
# I_AM_ALPINE.txt  lib              root             tmp
# bin              media            run              usr
# dev              mnt              sbin             var
# etc              opt              srv
# home             proc             sys

Additional Resources

Limit Syscalls

Goal

Restrict the number of system calls that the running process can make to protect the host system.

Theory

Certain system calls may pose security risks or impact the host system. Seccomp (Secure Computing Mode) is a Linux feature that allows developers to filter system calls to the kernel. Seccomp operates in two modes:

Strict: a minimal set of syscalls is allowed
Filter: allows developers to define custom policies for which syscalls are permitted

Seccomp filters are expressed as Berkeley Packet Filters (BPF) programs. These filters can be used to allow or deny system calls, as well as conditionally filter on system call arguments.

For this project, we will be using the following seccomp system calls

seccomp_init: initializes the seccomp filter
seccomp_rule_add: add a new filter rule to the current seccomp filter
seccomp_load: load the current seccomp filter into the kernel

Each filter in the seccomp filter returns an action, that can be one of:

SCMP_ACT_KILL: kill the thread
SCMP_ACT_KILL_PROCESS: kill the process
SCMP_ACT_TRAP: throw a SIGSYS signal
SCMP_ACT_ERRNO: return value with the specified error code:
SCMP_ACT_TRACE: notify the tracer
SCMP_ACT_LOG: logged
SCMP_ACT_ALLOW: allowed
SCMP_ACT_NOTIFY: notify the monitoring process

In short, Seccomp allows us to set rules that determine what happens when certain system calls are invoked. Seccomp is a powerful tool. But knowing which system calls to filter out is the tricky part. In this blog, I will focus only on the mechanism of filtering system calls and not discuss which system calls are dangerous. For an explanation of that, I suggest Lizzie’s blog or Docker’s documentation.

Demo

For this project, we will be using the syscallz crate, a seccomp library for Rust.

In the following example, we will try and limit the getpid system call. In the library, Context::init_with_action, ctx.set_action_for_syscall and ctx.load() are wrappers around seccomp_init, seccomp_rule_add, and seccomp_load.

use libc::getpid;
use syscallz::{Action, Context, Syscall};

fn main() {
    println!("pid (first attempt):, {}", unsafe { getpid() });

    match Context::init_with_action(Action::Allow) {
        Ok(mut ctx) => {
            ctx.set_action_for_syscall(Action::Errno(100), Syscall::getpid)
                .unwrap();
            ctx.load().unwrap();
        }
        Err(e) => {
            println!("Failed to init with action: {:?}", e);
        }
    }

    println!("pid (second attempt):, {}", unsafe { getpid() });
}

Compiling and executing the code above yields the following, where -100 is the corresponding error code.

pid (first attempt):, 6613
pid (second attempt):, -100

Implementation

For my project, I disabled the same set of syscalls that Lizzie’s implementation of container disables. Here is the implementation:

#![allow(unused)]
fn main() {
const DISABLED_SYSCALLS: [Syscall; 9] = [
    Syscall::keyctl,
    Syscall::add_key,
    Syscall::request_key,
    Syscall::ptrace,
    Syscall::mbind,
    Syscall::migrate_pages,
    Syscall::set_mempolicy,
    Syscall::userfaultfd,
    Syscall::perf_event_open,
];

fn syscalls() -> ContainerResult {
    let s_isuid: u64 = Mode::S_ISUID.bits().into();
    let s_isgid: u64 = Mode::S_ISGID.bits().into();
    let clone_newuser = CloneFlags::CLONE_NEWUSER.bits() as u64;

    // Each tuple: (SysCall, argument_idx, value). 0 would be the first argument index.
    let conditional_syscalls = [
        (Syscall::fchmod, 1, s_isuid),
        (Syscall::fchmod, 1, s_isgid),
        (Syscall::fchmodat, 2, s_isuid),
        (Syscall::fchmodat, 2, s_isgid),
        (Syscall::unshare, 0, clone_newuser),
        (Syscall::clone, 0, clone_newuser),
        // TODO: ioctl causes an error when running /bin/ash somehow...
        // (Syscall::ioctl, 1, TIOCSTI),
    ];
    match Context::init_with_action(Action::Allow) {
        Ok(mut ctx) => {
            for syscall in DISABLED_SYSCALLS {
                if let Err(err) = ctx.set_action_for_syscall(Action::Errno(0), syscall) {
                    return Err(ContainerError::DisableSyscall);
                };
            }

            for (syscall, arg_idx, bit) in conditional_syscalls {
                if let Err(err) = ctx.set_rule_for_syscall(
                    Action::Errno(1000),
                    syscall,
                    &[Comparator::new(arg_idx, Cmp::MaskedEq, bit, Some(bit))],
                ) {
                    return Err(ContainerError::DisableSyscall);
                }
            }

            if let Err(err) = ctx.load() {
                return Err(ContainerError::DisableSyscall);
            };
        }
        Err(err) => {
            return Err(ContainerError::DisableSyscall);
        }
    }
    Ok(())
}
}

seccomp_rule_add_array allows developers to filter a syscall based on specific argument values by providing a comparator. Here is the code I used to perform conditional filters:

#![allow(unused)]
fn main() {
ctx.set_rule_for_syscall(
    Action::Errno(1000),
    syscall,
    &[Comparator::new(arg_idx, Cmp::MaskedEq, bit, Some(bit))],
)
}

For example, to error our when unshare is invoked when it contains the clone_newuser bit, we can provide a Comparator to set_rule_for_syscall like this:

#![allow(unused)]
fn main() {
let clone_newuser = CloneFlags::CLONE_NEWUSER.bits() as u64;
ctx.set_rule_for_syscall(
    Action::Errno(1000),
    Syscall::unshare,
    &[Comparator::new(0, Cmp::MaskedEq, clone_newuser, Some(clone_newuser))],
);
}

Testing the Implementation

Now, let’s test whether our implementation works. In this test, we will confirm that performing unshare works without the CLONE_NEWUSER flag but fails with the CLONE_NEWUSER flag.

First, let’s confirm that unshare works when there are no flags set. Here is the unshare_test program:

use nix::sched::{unshare, CloneFlags};

fn main() {
    match unshare(CloneFlags::empty()) {
        Ok(_) => println!("Unshared success!"),
        Err(e) => println!("Error: {:?}", e),
    }
}

After compiling the binary for unshare_test, we need to copy the executable into the alpine directory before running the program in the container.

# inside the unshare_test repo
RUSTFLAGS="-C target-feature=+crt-static" cargo build --target="aarch64-unknown-linux-gnu"
cp target/aarch64-unknown-linux-gnu/debug/unshare_test /home/brianshih/alpine

# navigate to mini-container repo
sudo target/debug/mini-container /unshare_test /home/brianshih/alpine
# Unshared Success!

Based on the output of running the executable in the container environment, we’ve confirmed that unshare works when there are no flags set.

Now, let’s see what will happen if performing unshare with the CLONE_NEWUSER flag works with the following code:

use nix::sched::{unshare, CloneFlags};

fn main() {
    match unshare(CloneFlags::CLONE_NEWUSER) {
        Ok(_) => println!("Unshared success!"),
        Err(e) => println!("Error: {:?}", e),
    }
}

After compiling and copying the executable to the target root filesystem, I ran the executable in the container environment:

sudo target/debug/mini-container /unshare_test /home/brianshih/alpine
# Error: UnknownErrno

Based on the output, we have confirmed that it works.

To check which Seccomp mode and how many seccomp filters there are, you can perform grep Seccomp /proc/{pid}/status like follows:

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine
# ...
# Child pid: Pid(6381)

# Host system
grep Seccomp /proc/6381/status
# Seccomp:	2
# Seccomp_filters:	1

Here, we can see that Seccomp is in the filter mode and there is one filter since our code only initializes and loads one filter.

Additional Resources

Intro to Seccomp and Seccomp-bpf

Mozilla wiki - Seccomp

Capabilities

Goal

We want to granularly control and limit the privileges of processes within a container.

Theory

Traditionally, processes run with either a full set of privileges granted by the root user or with a limited set of privileges granted by the process’s user and groups. However, sometimes a program needs to be run by an unprivileged user but make privileged calls. One way to allow that is to set the suid bit on the file, which will cause the file to be executed by the user who owns the file. This makes the program susceptible to privilege escalation attacks.

Linux Capabilities are introduced as a mechanism that allows a process to perform privileged operations without being granted superuser access. Rather than a single privilege, the superuser privilege is divided into distinct units known as capabilities.

Rules of Capabilities

In Linux, both processes and files (executables) can have capabilities. So what capabilities are granted when a file is executed by a process? For that, we need to first introduce the concept of capabilities set.

Each process stores 5 different sets of capabilities (based on the “Thread capability sets” section in the Linux doc):

Effective: The kernel will run permission checks against effective capabilities. If the capability for a privileged operation is not set, a permission error will be thrown.
Permitted: superset for the effective capabilities. The process can transition it to the effective set dynamically.
Inheritable: capabilities inside the inheritable set will be added to the permitted set when a program is executed via the execve syscall
Bounding: the superset of all the capabilities. If a capability is not inside the bounded set, it is not allowed
Ambient: a set of capabilities preserved across an execve call that is not privileged. No capability can be ambient if it is not both permitted and inheritable.

Here is a screenshot from the Linux doc about how the different Linux capabilities will transform across execve calls:

Screenshot 2023-12-20 at 1.32.44 AM.png

If a user wants to execute a file that needs capability X, the user needs X to be inside P'(effective). In the 2 demos below, we will demonstrate how we can achieve that for different types of files.

Demo

Demo 1: Gaining Capabilities from Executables

One of the Linux Capabilities is CAP_NET_BIND_SERVICE, which determines whether a process can bind a socket to an Internet domain privileged port (port number less than 1024).

To start, I’ve created a Rust project with the following code. All this code snippet does is that it tries to create a TcpListener and bind it to a privileged address (80).

use std::net::TcpListener;

fn main() {
    let listener = TcpListener::bind("127.0.0.1:80").unwrap();
	  println!("TcpListner bound to 127.0.0.1:80. Accepting incoming connection"):
		listener.accept();
}

When we run this code, we will get this error:

#![allow(unused)]
fn main() {
Error: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
}

This is because normal processes have 0 capabilities. To verify this, we can look at /proc/$$/status to see that the CAP_NET_BIND_SERVICE bit is not in CapEff.

grep Cap /proc/$$/status
# CapInh:	0000000000000000
# CapPrm:	0000000000000000
# CapEff:	0000000000000000
# CapBnd:	000001ffffffffff
# CapAmb:	0000000000000000
capsh --decode=000001ffffffffff
# 0x000001ffffffffff=...cap_net_bind_service,cap_net_broadcast...

Now, let’s think about how we can grant capability to the process running the file.

Firstly, the file is clearly not capability-aware. Capability aware programs are programs that understand and manipulate capabilities through calls to libcap syscalls.

Therefore, in order for the CAP_NET_BIND_SERVICE capability to be inside the thread’s effective capability set after the execve call, one way is to add the capability to the file’s effective set and permitted set.

P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset) | P'(ambient)

If F(effective) is valid, we can perform the following algebra:

P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

⇒ P'(effective) = P'(permitted)

⇒ P'(effective) = (F(permitted) & cap_bset)

Since the capability is inside F(effective) and F(premitted), it will also be inside P'(effective).

Now let’s try setting the CAP_NET_BIND_SERVICE to the file and re-run it.

sudo setcap 'cap_net_bind_service=+ep' target/debug/hello_world
getcap target/debug/hello_world
# target/debug/hello_world cap_net_bind_service=ep
target/debug/hello_world
# TcpListener bound to 127.0.0.1:80. Accepting incoming connection

To grant a capability, we will use the setcap syscall. To verify that the capability is set, we use the getcap syscall. After setting the capability, we can bound the TcpListener to port 80.

Demo 2: Capability-aware files

Ideally, we would like to create an environment that doesn’t require giving the process root user privileges or granting the file capabilities.

Let’s look at this equation again:

P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

If we don’t set the F(effective) bit, then we need to ensure that P'(ambient) contains the capability bit. To do that, we need to create a capability-aware file. Capability-aware files can use the prctl calls to add capabilities to capability sets.

For example, prctl with arguments of PR_CAP_AMBIENT PR_CAP_AMBIENT_RAISE can add capabilities to the ambient set. According to prctl’s Linux doc, PR_CAP_AMBIENT_RAISE adds the capability specified in arg3 to the ambient set, and “the specified capability must already be present in both the permitted and the inheritable sets of the process”.

As a result, we need to add the capability to the inheritable set of the thread before adding it to the ambient set of the thread. We will add the capability to F(permitted) manually since I can’t seem to add it with prctl directly (I’m still going through the docs to find out why this is happening!).

Here is the set-ambient program (inspired by this blog) to do that:

use std::{env, ffi::CString};

use nix::unistd::execve;

fn set_ambient() {
    caps::raise(
        None,
        caps::CapSet::Inheritable,
        caps::Capability::CAP_NET_BIND_SERVICE,
    )
    .unwrap();

    caps::raise(
        None,
        caps::CapSet::Ambient,
        caps::Capability::CAP_NET_BIND_SERVICE,
    )
    .unwrap();
}

fn main() {
    let args: Vec<String> = env::args().collect();
    set_ambient();

    println!("CAP_NET_BIND_SERVICE is in ambient capabilities. Executing file.");
    if let Err(e) = execve::<CString, CString>(&CString::new(args[1].clone()).unwrap(), &[], &[]) {
        println!("Failed to execve: {:?}", e);
    }
}

We use the caps crate to set the capabilities. The call caps::raise(None, Ambient, CAP_NET_BIND_SERVICE) is a wrapper around the Linux call prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, NET_BIND_SERVICE).

As specified earlier, the capability must be present in both the permitted and the inheritable sets of the process. Therefore, we use sudo setcap to add the capability to the permitted set of the file.

After setting the capability bit for NET_BIND_SERVICE to the permission capability set of the file, let’s run /bin/bash with the set-ambient program. We can check the capability sets of the process via grep Cap /proc/$$/status and see that the effective bits for the process are 0000000000000400. Finally, we can use capsh --decode to confirm that cap_net_bind_service is in the process’s effective set.

sudo setcap cap_net_bind_service+p target/debug/set-ambient
target/debug/set-ambient /bin/bash
# CAP_NET_BIND_SERVICE is in ambient capabilities. Executing file.
grep Cap /proc/$$/status
# CapInh:	0000000000000400
# CapPrm:	0000000000000400
# CapEff:	0000000000000400
# CapBnd:	000001ffffffffff
# CapAmb:	0000000000000400
capsh --decode=0000000000000400
# 0x0000000000000400=cap_net_bind_service

Finally, we can run the file with the TcpListener again and this time, we can bound the listener to port 80.

target/debug/set-ambient ../tcp_example/target/debug/tcp_example
# TcpListener bound to 127.0.0.1:80. Accepting incoming connection

Implementation

My implementation takes in a list of capabilities to add and a list of capabilities to drop. If ALL is specified in cap-drop, then all capabilities are dropped.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine 
	--cap-drop ALL 
	--cap-add NET_BIND_SERVICE CAP_SETUID

Here is the pseudocode for the implementation:

for each capability to drop, drop them. If the capability specified is ALL, then loop through any capabilities in the bounding set unless it’s inside the capabilities to add
loop through the capabilities to add and add the capability set to the inheritable set and the ambient set.

Here is the actual code:

#![allow(unused)]
fn main() {
static CAPABILITIES: phf::Map<&'static str, Capability> = phf_map! {
    "NET_BIND_SERVICE" => caps::Capability::CAP_NET_BIND_SERVICE,
    "SETUID" => caps::Capability::CAP_SETUID,
    "CAP_SYS_TIME" => caps::Capability::CAP_SYS_TIME,
};

fn capabilities(config: &ChildConfig) -> ContainerResult {
    // compute the list of capabilities to add
    let caps_add: Vec<Capability> = match &config.cap_add {
        Some(cap_add) => {
            let mut res = vec![];
            for c in cap_add.iter() {
                match CAPABILITIES.get(c) {
                    Some(c) => {
                        res.push(c.clone());
                    }
                    None => {
                        return Err(ContainerError::CapabilityAdd);
                    }
                }
            }
            res
        }
        None => vec![],
    };

    // if ALL is inside the capabilities to drop, then drop all capabilities except
    // for the ones inside capabilities to add
    if let Some(caps) = &config.cap_drop {
        if caps.contains(&String::from("ALL")) {
            let bounding_caps = caps::read(None, caps::CapSet::Bounding).unwrap();
            for cap in bounding_caps.iter() {
                if !caps_add.contains(cap) {
                    if let Err(e) = caps::drop(None, caps::CapSet::Bounding, *cap) {
                        return Err(ContainerError::CapabilityDrop);
                    }
                }
            }
        } else {
            for c in caps.iter() {
                match CAPABILITIES.get(c) {
                    Some(c) => {
                        if let Err(e) = caps::drop(None, caps::CapSet::Bounding, *c) {
                            return Err(ContainerError::CapabilityDrop);
                        }
                        if let Err(e) = caps::drop(None, caps::CapSet::Inheritable, *c) {
                            return Err(ContainerError::CapabilityDrop);
                        }
                    }
                    None => {
                        return Err(ContainerError::CapabilityDrop);
                    }
                }
            }
        }
    }

    for cap in caps_add.iter() {
        if let Err(e) = caps::raise(None, caps::CapSet::Inheritable, *cap) {
            return Err(ContainerError::CapabilityAdd);
        }
        if let Err(e) = caps::raise(None, caps::CapSet::Ambient, *cap) {
            return Err(ContainerError::CapabilityAdd);
        }
    }
    Ok(())
}
}

Testing the Implementation

Let’s first confirm that dropping all capabilities and adding NET_BIND_SERVICE works.

sudo target/debug/mini-container /bin/ash /home/brianshih/alpine 
	--cap-drop ALL 
	--cap-add NET_BIND_SERVICE
# Child pid: Pid(6517)
# ...

# host system
grep Cap /proc/6517/status
# CapInh:	0000000000000400
# CapPrm:	0000000000000400
# CapEff:	0000000000000400
# CapBnd:	000001ffffffffff
# CapAmb:	0000000000000400
capsh --decode=0000000000000400
# 0x0000000000000400=cap_net_bind_service

Next, I built a Rust program with this code. All it does is print out the capability sets of the process and run setresuid, which is granted if the SETUID capability is set.

use nix::unistd::{setresuid, Uid};

fn main() {
    println!("Effective {:?}", caps::read(None, caps::CapSet::Effective));
    println!("Bounding {:?}", caps::read(None, caps::CapSet::Bounding));
    println!(
        "Inherited {:?}",
        caps::read(None, caps::CapSet::Inheritable)
    );
    println!("Permitted {:?}", caps::read(None, caps::CapSet::Permitted));
    println!("Ambient {:?}", caps::read(None, caps::CapSet::Ambient));

    if let Err(e) = setresuid(Uid::from_raw(10), Uid::from_raw(10), Uid::from_raw(10)) {
        println!("Failed to setuid: {:?}", e);
    }
    println!("Finished");
}

Next, let’s compile it and copy it to the alpine directory. Then we run the program in the container. We get an EPERM error. If we look at the logged lines, we can see that CAP_SETUID is not in the effective set of the process.

# compile it
RUSTFLAGS="-C target-feature=+crt-static" cargo build --target="aarch64-unknown-linux-gnu"
# copy it to the alpine directory
cp target/aarch64-unknown-linux-gnu/debug/setuid_example /home/brianshih/alpine
sudo target/debug/mini-container /setuid_example /home/brianshih/alpine
# Effective Ok({})
# Bounding Ok({CAP_SETGID, CAP_AUDIT_WRITE, CAP_SYS_RESOURCE, CAP_SETFCAP, CAP_BLOCK_SUSPEND, CAP_SYS_TTY_CONFIG, CAP_AUDIT_CONTROL, CAP_SYS_NICE, CAP_CHOWN, CAP_LEASE, CAP_MAC_OVERRIDE, CAP_FOWNER, CAP_BPF, CAP_SYS_BOOT, CAP_WAKE_ALARM, CAP_NET_BIND_SERVICE, CAP_IPC_OWNER, CAP_NET_BROADCAST, CAP_PERFMON, CAP_FSETID, CAP_SYS_ADMIN, CAP_SYSLOG, CAP_LINUX_IMMUTABLE, CAP_KILL, CAP_NET_ADMIN, CAP_DAC_READ_SEARCH, CAP_SYS_CHROOT, CAP_SYS_PACCT, CAP_SYS_RAWIO, CAP_SETUID, CAP_NET_RAW, CAP_AUDIT_READ, CAP_CHECKPOINT_RESTORE, CAP_SYS_TIME, CAP_MKNOD, CAP_SYS_PTRACE, CAP_MAC_ADMIN, CAP_DAC_OVERRIDE, CAP_IPC_LOCK, CAP_SETPCAP, CAP_SYS_MODULE})
# Inherited Ok({})
# Permitted Ok({})
# Ambient Ok({})
# Failed to setuid: EPERM

However, if we rerun the program with --cap-add SETUID, the program runs without error. If we look at the logged lines, we can see that CAP_SETUID is in the effective capability set of the process.

sudo target/debug/mini-container /setuid_example /home/brianshih/alpine 
		--cap-add SETUID
# Effective Ok({CAP_SETUID})
# Bounding Ok({CAP_SETFCAP, CAP_BPF, CAP_MKNOD, CAP_CHOWN, CAP_SETUID, CAP_SYS_TIME, CAP_FSETID, CAP_NET_ADMIN, CAP_SYS_CHROOT, CAP_LINUX_IMMUTABLE, CAP_IPC_LOCK, CAP_SYS_NICE, CAP_SYS_RAWIO, CAP_SETGID, CAP_KILL, CAP_DAC_OVERRIDE, CAP_CHECKPOINT_RESTORE, CAP_SYS_PACCT, CAP_SYS_PTRACE, CAP_MAC_ADMIN, CAP_WAKE_ALARM, CAP_AUDIT_WRITE, CAP_MAC_OVERRIDE, CAP_LEASE, CAP_SYS_RESOURCE, CAP_IPC_OWNER, CAP_FOWNER, CAP_SYS_MODULE, CAP_BLOCK_SUSPEND, CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_PERFMON, CAP_SYSLOG, CAP_NET_RAW, CAP_SYS_ADMIN, CAP_NET_BROADCAST, CAP_SYS_TTY_CONFIG, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_DAC_READ_SEARCH, CAP_SYS_BOOT})
# Inherited Ok({CAP_SETUID})
# Permitted Ok({CAP_SETUID})
# Ambient Ok({CAP_SETUID})

Additional Resources

User Namespace

Goal

The best way to prevent privilege-escalation attacks from within a container is to run the container’s executable as an unprivileged user. However, some applications require the process to run as a root user. Therefore, our goal is to set up an environment such that the user is privileged within the container but unprivileged to the host system.

Theory

User Namespaces isolate security-related identifiers. According to Linux’s doc, “a process’s user and group IDs can be different inside and outside a namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace”.

The user namespace is what enables a container to run as a root user within a container but have unprivileged access outside the container, which prevents privilege-escalation attacks.

An important property of user namespaces is that it is nested. Apart from the root namespace, each namespace has a parent namespace, which is the user namespace of the process that created the user namespace via a call to unshare or clone with the CLONE_NEWUSER flag.

User mappings

User mappings are what allow a process's user IDs to be different inside and outside a namespace.

When a user namespace is created, it starts without a mapping of User IDs to the parent user namespace. The /proc/pid/uid_map, which resides in the parent user namespace, maps the User IDs inside the parent user namespace to the User IDs inside the child user namespace.

Each line in the uid_map takes the form:

#![allow(unused)]
fn main() {
ID-in-child-ns   ID-in-parent-ns   length
}

ID-in-child-ns, ID-in-parent-ns, and length specify that a range of user IDs of length starting from ID-in-child-ns are mapped to a range of user IDs of length in the parent user namespace starting with ID-in-parent-ns.

For example, a line of 0 1000 1 means that the user with User ID 0 in the child user namespace maps to the user with User ID 1000.

Demo

In the demo below, we first create a user namespace with the -U flag. According to the Linux doc, an unmapped User ID is converted to the overflow user ID which is 65534. This is why the uid=65534. However, when we check the User ID for the process via ps -o 'pid uid user command' -a, we can see that the UID is 1000, the same User as the parent process’s user.

After retrieving the pid of the child process via echo $$, we write 0 1000 1 into the uid_map of the parent user namespace. We then check the user ID of the child process and now see 0.

Even though the User ID of the child process is 0, if we check the uid from the parent user namespace, it’s still 1000. We have successfully mapped the original user ID of 1000 to 0 in the new user namespace.

id
# uid=1000(brianshih) gid=1000(brianshih) groups=1000(brianshih) ...
unshare -U
id
# uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
echo $$
# 10087

# host system
ps -o 'pid uid user command' -a
# PID   UID USER     COMMAND
10087  1000 briansh+ -bash
echo '0 1000 1' > /proc/10087/uid_map

# in the child process
id
# uid=0(root) gid=65534(nogroup) groups=65534(nogroup)

# host system
ps -o 'pid uid user command' -a
# PID    UID  USER     COMMAND
# 10087  1000 briansh+ -bash

Implementation

The implementation is split into two portions:

user_ns: creating a new user namespace in the child process
handle_child_uid_map: updating to the uid_map in the parent process.

We can see that in the run method below, we handle_child_uid_map method is run after create_child_process. This is because to write to the uid_map, the parent process needs the newly created process’s pid.

#![allow(unused)]
fn main() {
fn run() -> ContainerResult {
    ...

    let child_pid = create_child_process(&config)?;

    handle_child_uid_map(child_pid, parent_socket.as_raw_fd(), config.user_id.clone())?;
    ...
}
}

Here is the code to create a new user namespace with unshare. After creating the new user namespace, the child process notifies the parent process that the child process has created a new user namespace. Next, the child process waits until the parent updates the uid_map before using setresuid to set the new user_id, which is likely 0.

#![allow(unused)]
fn main() {
fn user_ns(config: &ChildConfig) -> ContainerResult {
    if let Err(e) = unshare(CloneFlags::CLONE_NEWUSER) {
        println!("Failed to unshare with new user namespace: {:?}", e);
        return Err(ContainerError::UnshareNewUser);
    }

    // Notifies the parent process that the child process has created a new user namespace
    socket_send(config.socket_fd)?;

    // Wait for the parent process to update the uid_map before setting the uid

    socket_recv(config.socket_fd)?;

    if let Some(user_id) = config.user_id {
        println!("Setting UID to: {:?}", config.user_id);
        if let Err(e) = setresuid(
            Uid::from_raw(user_id),
            Uid::from_raw(user_id),
            Uid::from_raw(user_id),
        ) {
            println!("Failed to set uid. Error: {:?}", e);
            return Err(ContainerError::SetResuid);
        };
    }

    Ok(())
}

Here is the code for how the parent updates the uid_map. It first waits for the user to create a user namespace via socket_recv. It then writes to the uid_map file and gid_map file. Finally, it uses socket_send to notify the child that the uid_map is updated.

#![allow(unused)]
fn main() {
fn handle_child_uid_map(pid: Pid, fd: i32, user_id: Option<u32>) -> ContainerResult {
    // Wait for the user to create a user namespace
    socket_recv(fd)?;

    let user_id = match user_id {
        Some(id) => id,
        None => 0, // default to run as root if no user ID is provided
    };

    println!("Updating uid_map");
    match File::create(format!("/proc/{}/{}", pid.as_raw(), "uid_map")) {
        Ok(mut uid_map) => {
            if let Err(e) = uid_map.write_all(format!("0 {} {}", 1000, 1000).as_bytes()) {
                println!("Failed to write to uid_map. Error: {:?}", e);
                return Err(ContainerError::UidMap);
            }
        }
        Err(e) => {
            println!("Failed to create uid_map. Error: {:?}", e);
            return Err(ContainerError::UidMap);
        }
    }

    match File::create(format!("/proc/{}/{}", pid.as_raw(), "gid_map")) {
        Ok(mut uid_map) => {
            if let Err(e) = uid_map.write_all(format!("0 {} {}", 1000, 1000).as_bytes()) {
                println!("Failed to write to uid_map. Error: {:?}", e);
                return Err(ContainerError::UidMap);
            }
        }
        Err(e) => {
            println!("Failed to create uid_map. Error: {:?}", e);
            return Err(ContainerError::UidMap);
        }
    }

    println!("Finished updating uid_map. Notifying child process");

    // Notify the user that the uid_map is updated
    socket_send(fd)?;
    Ok(())
}
}

Testing the Implementation

We create the container environment and set the new user ID to 0 via --user 0. We then confirm that id inside the container is 0.

In the host system, we confirm that the process used to run the executable, /bin/ash has a UID of 1000. This confirms that the uid_mapping worked.

id
# uid=1000(brianshih) ...
sudo target/debug/mini-container /bin/ash /home/brianshih/alpine --user 0
id
# uid=0(root)

# host system
ps -o 'pid uid user command' -a

# PID    UID  USER     COMMAND
# 10074  0    root     target/debug/mini-container /bin/ash /home/brianshih/alpi
# 10075  1000 briansh+ /bin/ash

Additional Resources

Resource Restrictions

Goal

We want to limit and isolate resource usage such as CPU, memory, disk I/O, network, etc in a container.

Theory

Cgroups is a Linux kernel feature that allows developers to control how much of a given key resource (CPU, memory, etc) a process or a set of processes can access.

According to the Linux doc, the grouping of processes is provided through a pseudo-filesystem called cgroupfs. A cgroup is a collection of processes bound to a set of limits defined via the cgroup filesystem.

Each cgroup has a kernel component called a subsystem, also known as a resource controller.

Different subsystems limit different resources, such as the CPU time and memory available to a cgroup. To create a cgroup, you create a directory inside the cgroup filesystem:

mkdir /sys/fs/cgroup/cg1

Each file inside the cgroup directory corresponds to a different resource that can be limited. For example, the cgroup below contains files such as memory.max which limits the memory a cgroup can access.

ls /sys/fs/cgroup/cg1
# cgroup.controllers      cpuset.cpus.partition  memory.max
# cgroup.events           cpuset.mems            memory.min
# cgroup.freeze           cpuset.mems.effective  memory.numa_stat
# cgroup.kill             cpu.stat               memory.oom.group
# cgroup.max.depth        cpu.uclamp.max         memory.peak
# cgroup.max.descendants  cpu.uclamp.min         memory.pressure
# cgroup.pressure         cpu.weight             memory.reclaim
# ... many more

Demo

In this demo (inspired by Michael Kerrisk’s tech talk), we will create a cgroup and set pids.max to 5 and confirm that the process can only run 5 tasks at max.

sudo bash
cd /sys/fs/cgroup/
# we create a cgroup called foo
mkdir foo

# add the current process to the created cgroup
echo $$ > foo/cgroup.procs

# confirm that the current process belongs to the foo cgroup
cat /proc/$$/cgroup
# 0::/foo

# set the maximum number of tasks at once
echo 5 > /sys/fs/cgroup/foo/pids.max

for i in {1..5}; do sleep 1 & done
# [1] 8379
# [2] 8380
# [3] 8381
# [4] 8382
# bash: fork: retry: Resource temporarily unavailable

After creating a new cgroup called foo and adding the process into that cgroup, we set pids.max to 5. Next, we execute for i in {1..5}; do sleep 1 & done and see that when the process tries to run the 5th sleep 1, it errors out as the process cannot create 5 processes.

Implementation

There are many resources that we can choose to limit. For my toy container implementation, I will only limit the memory and max_pids. In the implementation, we will use the cgroup-rs crate, a Rust library for managing cgroups.

Note that limiting the resources is performed by the parent process after the child process is created. This is because we need the child process’s pid so that we can add it to the cgroup.

#![allow(unused)]
fn main() {
fn run() -> ContainerResult {
    ...
    let child_pid = create_child_process(&config)?;
    resources(&config, child_pid)?;
    ...
}
}

The code for limiting resources is simple. We create a new cgroup with the config.hostname as its name. We then write to the corresponding resource’s file before adding the pid to the created cgroup.

#![allow(unused)]
fn main() {
fn resources(config: &ChildConfig, pid: Pid) -> ContainerResult {
    println!("Restricting resource!");
    let mut cg_builder = CgroupBuilder::new(&config.hostname);
    if let Some(memory_limit) = config.memory {
        println!("Setting memory limit to: {:?}", memory_limit);

        cg_builder = cg_builder.memory().memory_hard_limit(memory_limit).done();
    }
    if let Some(max_pids) = config.max_pids {
        cg_builder = cg_builder
            .pid()
            .maximum_number_of_processes(cgroups_rs::MaxValue::Value(max_pids))
            .done();
    }

    let cg = cg_builder.build(Box::new(V2::new()));

    let pid: u64 = pid.as_raw() as u64;

    if let Err(e) = cg.add_task(CgroupPid::from(pid)) {
        println!("Failed to add task to cgroup. Error: {:?}", e);
        return Err(ContainerError::CgroupPidErr);
    };

    Ok(())
}

Testing the Implementation

This is the code snippet we will use to test whether limiting the number of pids in a cgroup works. This is basically a Rust implementation of our demo earlier: for i in {1..5}; do sleep 1 & done.

use std::thread;
use std::time::Duration;

fn main() {
    for i in 1..=5 {
        thread::spawn(move || {
            println!("Thread {} started", i);
            thread::sleep(Duration::from_secs(1));
            println!("Thread {} completed", i);
        });
    }

    // Sleep for a while to allow threads to finish.
    thread::sleep(Duration::from_secs(2));
}

When we run the executable, we get a Resource temporarily unavailable message. If we examine the hostname and check /sys/fs/cgroup/mini-JoYUGNc/pids.max, we can see that it’s 5. We can also check which cgroup the child process is to verify that it’s added to the cgroup correctly.

sudo target/debug/mini-container /sleep_test /home/brianshih/alpine 
		--nproc 5
# thread 'main' panicked at 'failed to spawn thread: Os 
# { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

hostname
# hostname of child process: mini-JoYUGNc

# host system
cat /sys/fs/cgroup/mini-JoYUGNc/pids.max
# 5

# pid of child process is 8428
cat /proc/8428/cgroup
# 0::/mini-OhMDCDW

Next, we run the same command without the --nproc 5 option:

#![allow(unused)]
fn main() {
sudo target/debug/mini-container /sleep_test /home/brianshih/alpine 
}

This time, it ran successfully, confirming that our cgroup implementation worked.

Additional Resources

Blog: What are Namespaces and cgroups?

Blog: Deep into Containers (Namespace & CGroups)

Future Work

There are a lot of container features I would love to explore in the future, such as:

networking: I would love to learn how systems like Docker enable containers to communicate with each other and outside the world.
image layers: I would love to learn how Docker images work under the hood and how the Docker engine's cache ensures the efficient creation of Docker images.
hacking a container: I would love to learn how to attack a container to understand container guarantees better.

Building a Container from Scratch in Rust