cgroup2

cgroup2

  • Docs
  • Git Repo

›Get started

Get started

  • Overview
  • Creating and organizing cgroups
  • Using PSI pressure metrics

Controllers

  • Memory controller
  • Memory controller strategies and tools
  • IO controller
  • CPU controller

Summary & results

  • fbtax2: Putting it all together

Creating and organizing cgroups

Initially, only the root cgroup exists, to which all processes belong. You create an empty child cgroup by adding a subdirectory:

mkdir /sys/fs/cgroup/cg1

Each cgroup has an interface file called cgroup.procs that lists the PIDs of all processes belonging to the cgroup, one per line. A process can be moved to a cgroup by writing its PID into the cgroup's cgroup.procs file:

echo 24982 > /sys/fs/cgroup/cg1/cgroup.procs

Only one process can be migrated on a single write call. If a process is composed of multiple threads, writing the PID of any thread migrates all threads of the process.

Note: A process can be in only one cgroup at a time.

Designing a cgroup hierarchy

Controllers can be enabled in any cgroup from the root to the leaves. Each controller distributes its system resource along the hierarchy, according to its configuration and the configuration of the hierarchy’s subtrees.

You specify controllers for each cgroup using two interface files that appear in every cgroup—including the root and all its children:

FileDescription
cgroup.controllersLists the controllers available in a cgroup. In the root cgroup, it lists all the controllers available on the system. In child cgroups it lists the controllers specified in its parent's cgroup.subtree_control file (see below).
cgroup.subtree_controlLists the controllers that are active (enabled) in the cgroup’s subtrees. The controllers listed here are the ones available to descendant cgroups; they're listed in the cgroup.controllers file in descendant cgroups.

You activate or deactivate controllers by writing their names to cgroup.subtree_control, each preceded by either a plus sign (+) to enable it, or a minus sign (-) to disable it, as in this example:

echo '+cpu -memory' > /sys/fs/cgroup/cg1/cgroup.subtree_control

In the hierarchy below, each cgroup.subtree_control file determines the set of controllers available to its child cgroups, i.e., the controllers that appear in the cgroup.controllers file of its children.

hierarchy

In this example, the root cgroup distributes resources to two partitions: system.slice where system processes run, and workload.slice, where production workload apps typically run. The set of resource controllers available to child cgroups with services like crond.service, or smc_proxy.service, is further restricted by the cgroup.subtree_control files of their respective parents.

The fbtax2 cgroup hierarchy

The cgroup hierarchy used for the fbtax2 project is similar to the one above, but introduces some additional structures and best practices.

It divides the hierarchy into three top-level cgroups, each with its own purpose. In addition to the system.slice cgroup where system binaries run, it also includes:

  • hostcritical.slice
  • workload.slice

hostcritical.slice

This cgroup protects processes required to keep the host running. It contains critical host management functions like sshd, and oomd, an alternative to the system OOM (out-of-memory) killer, that we'll look at in a later section of this case study.

workload.slice

Protecting the main workload from resource conflicts with system binaries was a primary goal of the project, and that's the main purpose of workload.slice. In this case, the main workload gets its own child cgroup workload-container.slice to protect hhvm.

Another child cgroup, workload-support.slice, provides protection to some of the system binaries needed to keep the main workload running. For example, if binaries like workload-support.service fail, the main workload can also fail. The exact set of system binaries a workload needs, and the amount of resources allocated to them, will differ depending on the context. But protecting required binaries in a cgroup like workload-support.slice helps ensure they're available to keep the main workload running.

In the next section of this case study, we'll look at how fbtax2 uses PSI memory pressure metrics.

Additional notes about cgroup hierarchies

All controller behaviors are hierarchical: if you enable a controller on a cgroup, it affects all processes belonging to the cgroups in its sub-hierarchy.

Similarly, most of the statistics you can query in cgroups, such as current memory or CPU usage, show the sum for the entire subtree.

Restrictions set closer to the root in the hierarchy can't be overridden from further away. When you enable a controller on a nested cgroup, it always restricts the resource distribution further.

Note: The sum of the resources consumed by child cgroups can't exceed the total amount of resources available to the parent.

Avoiding resource conflicts between parent and child cgroups

To eliminate situations where child cgroups compete for resources against the internal processes of their parent, cgroups can control distribution of memory and IO to their children only when they contain no processes of their own. In other words, cgroups can either contain tasks or control resources for child cgroups, but not both.

Note: An exception to this rule is the root cgroup, which can both contain processes and control resource distribution.

Thus, if a parent cgroup controls resource distribution to its child(ren) (i.e., by having a non-empty cgroup.subtree_control file) any processes in the parent cgroup must be moved to their own child cgroups.

For instance, a cgroup like /cg1/cg2 can contain processes, but if /cg1 also contains any processes of its own, those processes should be moved to their own leaf node, e.g., /cg1/cg3.

For certain legacy configurations, the CPU controller can also contain processes and control resource distribution. See the sections on thread mode on the cgroup manpage for details.

Other useful cgroup interface files

These cgroup interface files come in handy when designing and testing a cgroup hierarchy:

FileDescription
cgroup.eventsContains key/value pairs that identify states or events for the cgroup.

Events or state changes generate file-modified events that allow applications to track and monitor changes.

The populated field indicates whether or not the cgroup or any of its descendants contain any active processes: 1 indicates the cgroup and/or its children contain active processes; 0 means there are no active processes.
cgroup.max.depthSpecifies the limit on the nesting depth of descendant cgroups. 0 indicates no descendant cgroups can be created. Default is max.
cgroup.max.descendantsSpecifies the limit of the number of descendant cgroups that a cgroup may have. Writing the string max to this file means that no limit is imposed. Default value is max.
← Maximizing Resource Utilization with cgroup2PSI Pressure Metrics →
  • Designing a cgroup hierarchy
  • The fbtax2 cgroup hierarchy
  • Additional notes about cgroup hierarchies
    • Avoiding resource conflicts between parent and child cgroups
  • Other useful cgroup interface files
Facebook Open Source
Copyright © 2019 Facebook Inc.