cgroup2

cgroup2

  • Docs
  • Git Repo

›Get started

Get started

  • Overview
  • Creating and organizing cgroups
  • Using PSI pressure metrics

Controllers

  • Memory controller
  • Memory controller strategies and tools
  • IO controller
  • CPU controller

Summary & results

  • fbtax2: Putting it all together

PSI Pressure Metrics

Pressure Stall Information (PSI) metrics provide for the first time a canonical way to see resource pressure increases as they develop, with new pressure metrics for three major resources—memory, CPU, and IO.

These pressure metrics, in conjunction with cgroup2 and other kernel and userspace tools described in this guide, provide the information you need to detect resource shortages while they’re developing, and respond intelligently—by pausing or killing non-essentials, reallocating memory in the system, or by load shedding.

In short, these stats are like barometers that provide fair warning of impending resource shortages, enabling you to take more proactive, granular, and nuanced steps when resources start becoming scarce.

cgroup2 interface

In a system with the cgroup2 filesystem mounted, pressure stall information is tracked for tasks grouped into cgroups. The subdirectory for each cgroup controller in the cgroupfs mountpoint contains cpu.pressure, memory.pressure, and io.pressure files.

You query PSI for a cgroup by running the following command. This example queries cpu.pressure for a cgroup called cg1:

$ cat /sys/fs/cgroup/cg1/cpu.pressure

The output format for CPU is:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0

and for memory and IO:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Memory and IO show two metrics: some and full. The CPU controller shows only the some metric. The values for both some and full are shown in running averages over the last 10 seconds, 1 minute, and 5 minutes, respectively. The total statistic is the accumulated microseconds.

PSI shows pressure as a percentage of walltime in which some or all processes are blocked on a resource:

  • some is the percentage of walltime that some (one or more) tasks were delayed due to lack of resources—for example, a lack of memory.
  • full is the percentage of walltime in which all tasks were delayed by lack of resources, i.e., the amount of completely unproductive time.

See the PSI docs for more details on these metrics.

Memory pressure in the fbtax2 project

With the PSI memory pressure metrics, the fbtax2 team finally had a tool to determine if a given workload has enough memory, and to provide general insight into system memory use, that was better than the trial and error processes they had to rely on before.

As discussed previously, a key goal in designing the fbtax2 cgroup hierarchy was protecting workload.slice from resource collisions with widely-distributed system binaries in system.slice.

Processes in system.slice tend to be less latency-sensitive than workload.slice processes, and can usually ride out slowdowns caused by lack of memory without serious consequences. As a result, workload.slice is given resource priority at the expense of slowdowns in system.slice, helping ensure the stability of the main workload.

This chart shows how the PSI memory pressure metrics allowed the team to see clearly the results of their efforts: memory.pressure.full for system.slice (green line) spiked as high as 50% (right X-axis), while pressure for workload.slice (orange line) stayed under 10%, indicating its increased memory priority and protection. Meanwhile, system performance as a whole (blue line) remained relatively stable between 500 and 600 rps (left X-axis).

We'll look at the specific memory controller configurations fbtax2 used to get these results in the next section of this case study.

← Creating and organizing cgroupsMemory Controller →
  • cgroup2 interface
  • Memory pressure in the fbtax2 project
Facebook Open Source
Copyright © 2019 Facebook Inc.