Project Overview

Here is the higher-level setup for this project:

parse the command line argument
create the child process
setup the namespaces, capabilities, and syscalls restrictions
executing the program

Parse the command line argument

To parse the command line arguments, we use the clap crate. Here is the struct representation of the parsed arguments:

#![allow(unused)]
fn main() {
#[derive(Parser)]
struct Cli {
    /// Command to execute
    command: String,

    /// Absolute path to new root filesystem
    root_filesystem_path: String,

    /// Optional pid for child process
    #[arg(short, long)]
    pid: Option<u32>,

    /// Memory limit (megabytes)
    #[arg(short, long)]
    memory: Option<i64>,

    /// Memory limit (megabytes)
    #[arg(long)]
    nproc: Option<i64>,

    /// Memory limit (megabytes)
    #[arg(short, long)]
    user: Option<u32>,

    // Add capabilities to the bounding set
    #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')]
    cap_add: Option<Vec<String>>,

    // Remove capabilities to the bounding set, or all if the String provided is "ALL"
    #[clap(long, value_parser, num_args = 1.., value_delimiter = ' ')]
    cap_drop: Option<Vec<String>>,
}
}

The entry point of the project is the run method. All we have to do is call Cli::parse() to parse the arguments

fn main() {
    if let Err(_) = run() {
        cleanup();
        exit(-1);
    }
}

fn run() -> ContainerResult {
    let cli = Cli::parse();
	  ...
}

Create the child process

Since a container is just a process, we need to create the child process for the container. The create_child_process function is responsible for that.

#![allow(unused)]
fn main() {
fn run() -> ContainerResult {
    let cli = Cli::parse();

	  ...
    let child_pid = create_child_process(&config)?;
    if let Err(e) = waitpid(child_pid, None) {
        return Err(ContainerError::WaitPid);
    };
    Ok(())
}
}

After creating the child process, we need to make sure the parent process doesn't terminate until the child process completes. We use the waitpid call to make sure of that.

Here is the implementation for create_child_process:

#![allow(unused)]
fn main() {
// Creates a child process with clone and runs the executable file
// with execve in the child process.
fn create_child_process(config: &ChildConfig) -> Result<Pid, ContainerError> {
    let mut flags = CloneFlags::empty();
    flags.insert(CloneFlags::CLONE_NEWNS);
    flags.insert(CloneFlags::CLONE_NEWCGROUP);
    flags.insert(CloneFlags::CLONE_NEWPID);
    flags.insert(CloneFlags::CLONE_NEWIPC);
    flags.insert(CloneFlags::CLONE_NEWNET);
    flags.insert(CloneFlags::CLONE_NEWUTS);
    let mut stack = [0; STACK_SIZE];
    let clone_res = unsafe {
        clone(
            Box::new(|| match child(config) {
                Ok(_) => 0,
                Err(_) => -1,
            }),
            &mut stack,
            flags,
            Some(Signal::SIGCHLD as i32),
            // If the signal SIGCHLD is ignored, waitpid will hang until the
            // child exits and then fail with code ECHILD.
        )
    };

    match clone_res {
        Ok(pid) => {
            println!("Child pid: {:?}", pid);
            Ok(pid)
        }
        Err(_) => Err(ContainerError::Clone),
    }
}
}

It uses clone to create the child process. It clones with a bunch of flags such as CLONE_NEWNS, CLONE_NEWPID, etc in order to create the different namespaces (user, mount, pid, etc) necessary for isolation. We will cover these namespaces in more detail later.

The Linux clone method takes a function argument. When the function returns, the child process terminates. The function we pass to clone is the child method whose responsibility is to set up the container environment and execute the user-provided program.

Setup the namespaces, capabilities, and syscalls restrictions & Executing the program

Here is the implementation of child:

#![allow(unused)]
fn main() {
// setup the namespaces, capabilities, syscall restrictions before running the executable
fn child(config: &ChildConfig) -> ContainerResult {
    set_hostname(config)?;
    isolate_filesystem(config)?;
    user_ns(config)?;
    capabilities(config)?;
    syscalls()?;
    match execve::<CString, CString>(&config.exec_path, &config.args, &[]) {
        Ok(_) => Ok(()),
        Err(e) => {
            println!("Failed to execute!: {:?}", e);
            Err(ContainerError::Execve)
        }
    }
}
}

Before using execve to execute the user-provided program, we set up the container environment for the execution by isolating the filesystem, setting up the user namespace, granting and taking away capabilities, and restricting syscalls.

Summary

To summarize, the project contains these core methods:

run: parses the command line arguments. Creates the child process and waits until the child process terminates
create_child_process: uses clone to create the child process. Pass in the child as the function argument to clone
child: sets up the container environment before executing the user-provided program with execve

For the rest of this blog, we will focus on learning how we can set up the container environment for the process. For each component of the container environment, we will break it down into:

Goal
Theory
Demo
Implementation
Testing the Implementation

Building a Container from Scratch in Rust

Project Overview

Parse the command line argument

Create the child process

Setup the namespaces, capabilities, and syscalls restrictions & Executing the program

Summary