Creating and organizing cgroups
Initially, only the root cgroup exists, to which all processes belong. You create an empty child cgroup by adding a subdirectory:
mkdir /sys/fs/cgroup/cg1
Each cgroup has an interface file called cgroup.procs
that lists the PIDs of all processes belonging to the cgroup, one per line. A process can be moved to a cgroup by writing its PID into the cgroup's cgroup.procs file:
echo 24982 > /sys/fs/cgroup/cg1/cgroup.procs
Only one process can be migrated on a single write call. If a process is composed of multiple threads, writing the PID of any thread migrates all threads of the process.
Note: A process can be in only one cgroup at a time.
Designing a cgroup hierarchy
Controllers can be enabled in any cgroup from the root to the leaves. Each controller distributes its system resource along the hierarchy, according to its configuration and the configuration of the hierarchy’s subtrees.
You specify controllers for each cgroup using two interface files that appear in every cgroup—including the root and all its children:
File | Description |
---|---|
cgroup.controllers | Lists the controllers available in a cgroup. In the root cgroup, it lists all the controllers available on the system. In child cgroups it lists the controllers specified in its parent's cgroup.subtree_control file (see below). |
cgroup.subtree_control | Lists the controllers that are active (enabled) in the cgroup’s subtrees. The controllers listed here are the ones available to descendant cgroups; they're listed in the cgroup.controllers file in descendant cgroups. |
You activate or deactivate controllers by writing their names to cgroup.subtree_control
, each preceded by either a plus sign (+) to enable it, or a minus sign (-) to disable it, as in this example:
echo '+cpu -memory' > /sys/fs/cgroup/cg1/cgroup.subtree_control
In the hierarchy below, each cgroup.subtree_control
file determines the set of controllers available to its child cgroups, i.e., the controllers that appear in the cgroup.controllers
file of its children.
In this example, the root cgroup distributes resources to two partitions: system.slice
where system processes run, and workload.slice
, where production workload apps typically run. The set of resource controllers available to child cgroups with services like crond.service
, or smc_proxy.service
, is further restricted by the cgroup.subtree_control
files of their respective parents.
The fbtax2 cgroup hierarchy
The cgroup hierarchy used for the fbtax2 project is similar to the one above, but introduces some additional structures and best practices.
It divides the hierarchy into three top-level cgroups, each with its own purpose. In addition to thesystem.slice
cgroup where system binaries run, it also includes:
hostcritical.slice
workload.slice
hostcritical.slice
This cgroup protects processes required to keep the host running. It contains critical host management functions likesshd
, and oomd
, an alternative to the system OOM (out-of-memory) killer, that we'll look at in a later section of this case study.
workload.slice
Protecting the main workload from resource conflicts with system binaries was a primary goal of the project, and that's the main purpose of workload.slice
. In this case, the main workload gets its own child cgroup workload-container.slice
to protect hhvm
.
workload-support.slice
, provides protection to some of the system binaries needed to keep the main workload running. For example, if binaries like workload-support.service
fail, the main workload can also fail. The exact set of system binaries a workload needs, and the amount of resources allocated to them, will differ depending on the context. But protecting required binaries in a cgroup like workload-support.slice
helps ensure they're available to keep the main workload running.
In the next section of this case study, we'll look at how fbtax2 uses PSI memory pressure metrics.
Additional notes about cgroup hierarchies
All controller behaviors are hierarchical: if you enable a controller on a cgroup, it affects all processes belonging to the cgroups in its sub-hierarchy.
Similarly, most of the statistics you can query in cgroups, such as current memory or CPU usage, show the sum for the entire subtree.
Restrictions set closer to the root in the hierarchy can't be overridden from further away. When you enable a controller on a nested cgroup, it always restricts the resource distribution further.
Note: The sum of the resources consumed by child cgroups can't exceed the total amount of resources available to the parent.
Avoiding resource conflicts between parent and child cgroups
To eliminate situations where child cgroups compete for resources against the internal processes of their parent, cgroups can control distribution of memory and IO to their children only when they contain no processes of their own. In other words, cgroups can either contain tasks or control resources for child cgroups, but not both.
Note: An exception to this rule is the root cgroup, which can both contain processes and control resource distribution.
Thus, if a parent cgroup controls resource distribution to its child(ren) (i.e., by having a non-empty cgroup.subtree_control
file) any processes in the parent cgroup must be moved to their own child cgroups.
For instance, a cgroup like /cg1/cg2
can contain processes, but if /cg1
also contains any processes of its own, those processes should be moved to their own leaf node, e.g., /cg1/cg3
.
For certain legacy configurations, the CPU controller can also contain processes and control resource distribution. See the sections on thread mode on the cgroup manpage for details.
Other useful cgroup interface files
These cgroup interface files come in handy when designing and testing a cgroup hierarchy:
File | Description |
---|---|
cgroup.events | Contains key/value pairs that identify states or events for the cgroup. Events or state changes generate file-modified events that allow applications to track and monitor changes. The populated field indicates whether or not the cgroup or any of its descendants contain any active processes: 1 indicates the cgroup and/or its children contain active processes; 0 means there are no active processes. |
cgroup.max.depth | Specifies the limit on the nesting depth of descendant cgroups. 0 indicates no descendant cgroups can be created. Default is max . |
cgroup.max.descendants | Specifies the limit of the number of descendant cgroups that a cgroup may have. Writing the string max to this file means that no limit is imposed. Default value is max . |