Using Cgroups v2 to limit system I/O resources

Introduction to Cgroups

Cgroups, which called control group in Linux to limit system resources for specify process group, is comonly used in many container tech, such as Docker, Kubernetes, iSulad etc.

The cgroup architecture is comprised of two main components:

  • the cgroup core: which contains a pseudo-filesystem cgroupfs,
  • the subsystem controllers: thresholds for system resources, such as memory, CPU, I/O, PIDS, RDMA, etc.

The hierarchy of Cgroups

In Cgroups v1, per-resource (memory, cpu, blkio) have it’s own hierarchy, where each resource hierarchy contains cgroups for that resource. The location of cgroup directory is /sys/fs/cgroup/.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ > tree -L 1 /sys/fs/cgroup
/sys/fs/cgroup
|-- blkio
|-- cpu -> cpu,cpuacct
|-- cpuacct -> cpu,cpuacct
|-- cpu,cpuacct
|-- cpuset
|-- devices
|-- freezer
|-- hugetlb
|-- memory
|-- net_cls -> net_cls,net_prio
|-- net_cls,net_prio
|-- net_prio -> net_cls,net_prio
|-- perf_event
|-- pids
|-- rdma
`-- systemd

User can create sub cgroup in every resource diectory to limit the process usage of the resource.

Directory tree above shows the system resources can be control. #{TODO}

Unlike cgroup v1, v2 has only single hierarchy.

1
2
3
4
5
6
$ > ls -p /sys/fs/cgroup
cgroup.controllers      cgroup.subtree_control  init.scope/      /sys/fs/cgroup/nvme/
cgroup.max.depth        cgroup.threads          io.cost.model    system.slice/
cgroup.max.descendants  cpu.pressure            io.cost.qos      user.slice/
cgroup.procs            cpuset.cpus.effective   io.pressure
cgroup.stat             cpuset.mems.effective   memory.pressure

The /sys/fs/cgroup is the root cgroup. In Figure below shows the relationship between parent cgroup and child cgroup.

File cgroup.controllers lists the controllers available in this cgroup, the root cgroup has all the controllers on the system.

File cgroup.subtree_control has the resource can be limited in it’s child cgroup. Those limits can be inherited by sub cgroup.

The child cgroup’s cgroup.controllers will generate by it’s parent cgroup’s cgroup.subtree_control file.

1
2
3
4
$ > cat cgroup.controllers
cpuset cpu io memory pids rdma
$ > cat cgroup.subtree_control
cpuset cpu io memory pids

This cgroup can limit cpuset, cpu, I/O, memory, rdma and pids, and it’s child cgroup can limit cpuset, cpu, I/O, memory and pids.

To modify the cgroup.subtree_control, can use plus sign(+) or minus sign(-) to enable or disable controllers.

As in this example:

1
$ > echo '+cpu -memory' | sudo tee /sys/fs/cgroup/cgroup.subtree_control

Enable cgroups v2 in Linux

First, check the current cgroups fs version with command:

1
$ > stat -fc %T /sys/fs/cgroup/

For cgroup v2, the output is cgroup2fs

For cgroup v1, the output is tmpfs

If the output is tmpfs,

Edit grub config /etc/default/grub, append systemd.unified_cgroup_hierarchy=1 to config GRUB_CMDLINE_LINUX.

1
GRUB_CMDLINE_LINUX="ip=dhcp console=ttyS0,115200 console=tty console=ttyS0 systemd.unified_cgroup_hierarchy=1"

Then execute sudo update-grub, after system reboot, the cgroups v2 will be enabled.

Configure cgroup to limit I/O resource

To create a new cgroup, use mkdir in /sys/fs/cgroup/

1
2
$ > cd /sys/fs/cgroups/
$ > sudo mkdir nvme

In nvme directory, we have

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ > ls nvme/
cgroup.controllers      cpu.max                cpuset.mems            memory.low
cgroup.events           cpu.pressure           cpuset.mems.effective  memory.max
cgroup.freeze           cpu.stat               io.max                 memory.min
cgroup.max.depth        cpu.uclamp.max         io.pressure            memory.oom.group
cgroup.max.descendants  cpu.uclamp.min         io.stat                memory.pressure
cgroup.procs            cpu.weight             io.weight              memory.stat
cgroup.stat             cpu.weight.nice        memory.current         pids.current
cgroup.subtree_control  cpuset.cpus            memory.events          pids.events
cgroup.threads          cpuset.cpus.effective  memory.events.local    pids.max
cgroup.type             cpuset.cpus.partition  memory.high

To enable the control to I/O resources, we need to enable the I/O controller in the /sys/fs/cgroup.subtree_control of root cgroup.

1
2
$ /sys/fs/cgroup> echo "+io" | sudo tee cgroup.subtree_control
$ /sys/fs/cgroup> sudo echo "+io" > cgroup.subtree_control # permission denied

Dont’t use > to redirect output to this file (unles in sudo mode), because > shell process redirection first before running follwing command, so the sudo command did not take effect.

We also need to find out the major and minor number of block devices.

1
2
3
4
5
6
7
8
$ > cat /proc/partitions
major minor  #blocks  name
	...			...		....		...
   8        0   83886080 sda
   8        1       1024 sda1
   8        2   83883008 sda2
   2        0          4 fd0
 259        0    4194304 nvme0n1

We will use the major and minor number of nvme0n1 device partition.

There also have other control keys:

  • riops: Max read I/O operations per second
  • wiops: Max write I/O operations per second
  • rbps: Max read bytes per second
  • wbps: Max write bytes per second

The lines of io.max must keyed by $MAJOR:$MINOR riops=? wiops=? rbps=? wbps=?

We can remove the limit of control by setting riops=max etc.

We limit the write bandwidth (wbps, write bytes per second) of io.max to 1 MB/s:

1
$ > /sys/fs/cgroup/nvme> echo "259:0 wbps=1048576" | sudo tee io.max

After configured the I/O limit of resources, now we need to add the process pid will be controled.

When execute this command, $$ means the pid of current shell :

1
$ > /sys/fs/cgroup/nvme> echo $$ | sudo tee cgroup.procs

Now, let’s use dd to generate I/O workload, before we start testing, clear the page cache first!

Clear the page cache:

1
2
3
4
5
6
7
8
9
$> free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       169Mi       3.6Gi       0.0Ki       107Mi       3.5Gi
Swap:         3.8Gi          0B       3.8Gi
$> sync && echo 1 | sudo tee /proc/sys/vm/drop_caches
$> free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       168Mi       3.6Gi       0.0Ki       105Mi       3.5Gi
Swap:         3.8Gi          0B       3.8Gi

Use dd to generate I/O workload:

1
2
3
4
$ > sudo dd if=/dev/zero of=/tmp/test/file1 bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0560001 s, 1.9 GB/s

Note that, the data had been writen to page cache immediately.

Using iostat to monitor the writeback performance, we can found that the MB_wrtn/s is limited to 1.0MB/s, which is a difference between cgroups v1 and v2 that cgroups v2 can limit the writeback bandwidth.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$ > iostat -h -m -d 1 100
Linux 5.4.0-64-generic (fvm) 	04/08/2023 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.8%    0.4%    1.5%   23.1%    0.0%   74.2%
										......................................
      tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd Device
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k fd0
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop0
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop2
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop3
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop4
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop5
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop6
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop7
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k loop8
     3.00         0.0k         1.0M         0.0k       0.0k       1.0M       0.0k nvme0n1
     0.00         0.0k         0.0k         0.0k       0.0k       0.0k       0.0k sda

Furthermore, when add the direct oflag which means directly write data to device without page cache, the write bandwidth is limited to 1.0 MB/s.

1
2
3
4
5
$ > sudo dd if=/dev/zero of=/tmp/test/file1 bs=1M count=100 oflag=d
irect
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 100.003 s, 1.0 MB/s

Summary

In this article, we can learn about Cgroups in Linux, and how to use Cgroups to limit system I/O resources.

In a subsequent article, I will introduce some system API related to Cgroups.

Reference

  1. https://facebookmicrosites.github.io/cgroup2/docs/overview.html
  2. https://kubernetes.io/docs/concepts/architecture/cgroups/
  3. https://andrestc.com/post/cgroups-io/
  4. https://docs.kernel.org/admin-guide/cgroup-v2.html
  5. https://manpath.be/f35/7/cgroups#L557
  6. https://zorrozou.github.io/docs/%E8%AF%A6%E8%A7%A3Cgroup%20V2.html
0%