Resource Restrictions

Goal

We want to limit and isolate resource usage such as CPU, memory, disk I/O, network, etc in a container.

Theory

Cgroups is a Linux kernel feature that allows developers to control how much of a given key resource (CPU, memory, etc) a process or a set of processes can access.

According to the Linux doc, the grouping of processes is provided through a pseudo-filesystem called cgroupfs. A cgroup is a collection of processes bound to a set of limits defined via the cgroup filesystem.

Each cgroup has a kernel component called a subsystem, also known as a resource controller.

Different subsystems limit different resources, such as the CPU time and memory available to a cgroup. To create a cgroup, you create a directory inside the cgroup filesystem:

mkdir /sys/fs/cgroup/cg1

Each file inside the cgroup directory corresponds to a different resource that can be limited. For example, the cgroup below contains files such as memory.max which limits the memory a cgroup can access.

ls /sys/fs/cgroup/cg1
# cgroup.controllers      cpuset.cpus.partition  memory.max
# cgroup.events           cpuset.mems            memory.min
# cgroup.freeze           cpuset.mems.effective  memory.numa_stat
# cgroup.kill             cpu.stat               memory.oom.group
# cgroup.max.depth        cpu.uclamp.max         memory.peak
# cgroup.max.descendants  cpu.uclamp.min         memory.pressure
# cgroup.pressure         cpu.weight             memory.reclaim
# ... many more

Demo

In this demo (inspired by Michael Kerrisk’s tech talk), we will create a cgroup and set pids.max to 5 and confirm that the process can only run 5 tasks at max.

sudo bash
cd /sys/fs/cgroup/
# we create a cgroup called foo
mkdir foo

# add the current process to the created cgroup
echo $$ > foo/cgroup.procs

# confirm that the current process belongs to the foo cgroup
cat /proc/$$/cgroup
# 0::/foo

# set the maximum number of tasks at once
echo 5 > /sys/fs/cgroup/foo/pids.max

for i in {1..5}; do sleep 1 & done
# [1] 8379
# [2] 8380
# [3] 8381
# [4] 8382
# bash: fork: retry: Resource temporarily unavailable

After creating a new cgroup called foo and adding the process into that cgroup, we set pids.max to 5. Next, we execute for i in {1..5}; do sleep 1 & done and see that when the process tries to run the 5th sleep 1, it errors out as the process cannot create 5 processes.

Implementation

There are many resources that we can choose to limit. For my toy container implementation, I will only limit the memory and max_pids. In the implementation, we will use the cgroup-rs crate, a Rust library for managing cgroups.

Note that limiting the resources is performed by the parent process after the child process is created. This is because we need the child process’s pid so that we can add it to the cgroup.

#![allow(unused)]
fn main() {
fn run() -> ContainerResult {
    ...
    let child_pid = create_child_process(&config)?;
    resources(&config, child_pid)?;
    ...
}
}

The code for limiting resources is simple. We create a new cgroup with the config.hostname as its name. We then write to the corresponding resource’s file before adding the pid to the created cgroup.

#![allow(unused)]
fn main() {
fn resources(config: &ChildConfig, pid: Pid) -> ContainerResult {
    println!("Restricting resource!");
    let mut cg_builder = CgroupBuilder::new(&config.hostname);
    if let Some(memory_limit) = config.memory {
        println!("Setting memory limit to: {:?}", memory_limit);

        cg_builder = cg_builder.memory().memory_hard_limit(memory_limit).done();
    }
    if let Some(max_pids) = config.max_pids {
        cg_builder = cg_builder
            .pid()
            .maximum_number_of_processes(cgroups_rs::MaxValue::Value(max_pids))
            .done();
    }

    let cg = cg_builder.build(Box::new(V2::new()));

    let pid: u64 = pid.as_raw() as u64;

    if let Err(e) = cg.add_task(CgroupPid::from(pid)) {
        println!("Failed to add task to cgroup. Error: {:?}", e);
        return Err(ContainerError::CgroupPidErr);
    };

    Ok(())
}

Testing the Implementation

This is the code snippet we will use to test whether limiting the number of pids in a cgroup works. This is basically a Rust implementation of our demo earlier: for i in {1..5}; do sleep 1 & done.

use std::thread;
use std::time::Duration;

fn main() {
    for i in 1..=5 {
        thread::spawn(move || {
            println!("Thread {} started", i);
            thread::sleep(Duration::from_secs(1));
            println!("Thread {} completed", i);
        });
    }

    // Sleep for a while to allow threads to finish.
    thread::sleep(Duration::from_secs(2));
}

When we run the executable, we get a Resource temporarily unavailable message. If we examine the hostname and check /sys/fs/cgroup/mini-JoYUGNc/pids.max, we can see that it’s 5. We can also check which cgroup the child process is to verify that it’s added to the cgroup correctly.

sudo target/debug/mini-container /sleep_test /home/brianshih/alpine 
		--nproc 5
# thread 'main' panicked at 'failed to spawn thread: Os 
# { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

hostname
# hostname of child process: mini-JoYUGNc

# host system
cat /sys/fs/cgroup/mini-JoYUGNc/pids.max
# 5

# pid of child process is 8428
cat /proc/8428/cgroup
# 0::/mini-OhMDCDW

Next, we run the same command without the --nproc 5 option:

#![allow(unused)]
fn main() {
sudo target/debug/mini-container /sleep_test /home/brianshih/alpine 
}

This time, it ran successfully, confirming that our cgroup implementation worked.

Additional Resources

Blog: What are Namespaces and cgroups?

Blog: Deep into Containers (Namespace & CGroups)