Skip to main content

Memory Control

Most people have experienced memory shortages - on personal computers, servers, or even phones. When a workload's working set size significantly exceeds available memory, the system falls into a deep thrashing state and moves at a glacial pace, while the IO device remains constantly busy. This is because the memory has to be paged in from, and out to, the storage device on demand when the working set exceeds available memory. How aggressively the thrashing begins is determined by a combination of two factors: The workload's memory access pattern, and the performance gap between memory and the storage device. For many workloads, the access pattern has a mixture of hot and cold areas. As the memory gets tighter, cold areas get kicked out of memory and faulted back in as needed.

As long as the storage device keeps up with the pace, the workload can run fine. If memory squeezes further, the demand on the storage device keeps rising. If the memory isn't enough to hold the hot areas, demand can spike abruptly. When demand exceeds the storage device's capabilities, the workload slows to a crawl, its progress bound to page fault IOs.

The probabilistic nature and tendency to drastic behavior makes memory sizing a challenge. Determining the optimal amount is difficult, and getting it wrong - even just a little too low - can lead to catastrophic failures.

Cgroup2 recognizes this inherent difficulty and provides a robust and forgiving way to allocate memory between cgroups: the work-conserving memory.low control knob, which is the primary knob used for memory control in this demo.

The Linux kernel has four memory control knobs - memory.max, memory.high, memory.min and memory.low. The first two are limit mechanisms, while the latter two are for protection.

max and high put hard limits on how much memory a cgroup and its descendants can use. max triggers OOM-kills when the workload can't be made to fit. high slows it down so that userspace agent can handle the situation. min and low work the other way around. Instead of limiting, they protect memory for a cgroup and its descendants. If a cgroup has 4G low and is currently using 6G, only the overage - 2G - is considered for memory reclaim. It will experience the same level of reclaim pressure as an unprotected peer cgroup using 2G. The difference between min and low is that low can be ignored if the alternative is triggering OOM kills. For brevity, we'll only discuss low from now on.

For more details, see the Memory Controller documentation.

memory.low

There are a couple great things about low.

First, it's work-conserving - if the protected cgroup doesn't use the memory, it's available for others to use. Contrast this with memory limits, where even if the system is otherwise completely idle, that free memory can't be used if a cgroup has already reached its limit.

Second, it provides a gradient of protection. As a cgroup's usage grows past the protected amount, the protected amount remains protected, but reclaim pressure for the excess amount gradually increases. For example, a cgroup with 12G of protection that's using 15G still enjoys strong protection - it only experiences reclaim pressure equivalent to a 3G cgroup. But if its requirement drops, or the system comes under extreme memory pressure, the cgroup can still safely yield some memory to where it's most needed by the system. These two factors combined make low easy and safe to use. No need to figure out the exact amount: You can just say "prioritize 75% of memory for the main workload and leave the rest for the usual usage-based distribution". The main workload will then prioritize and comfortably occupy most of the system - it only has to compete in the top 25%. If the situation gets too tight for the rest of the system, i.e., the management portion, they'll be able to wrangle out what's needed to ride out the tough times. Compare the above to limits, where the configuration is the other way around: The management portion is limited to protect the main workload. Because reclaim pressure is applied based on comparative size, a 25% limit for them doesn't make sense: If the main workload is at 80% and the others at 20%, the main workload experiences 4 times more reclaim pressure, which is clearly not what we want.

So to compensate, we decide to set the management portion's limit lower, but how much lower? Let's say 5% is usually enough and we set it to 5%. We're generally happy but something new rolls out that temporarily needs a bit more than 5% and the management portion goes belly up fleet-wide. We adjust it back up a bit, but again, by how much? This cycle can eventually reach the point where the limit is both too high for adequate workload protection and too low to avoid noticeable increase in management operation failures.

Memory Control Configuration For this Demo

This demo uses the following static memory control configuration:

  • init.scope : 16M min - systemd
  • hostcritical.slice : 768M min - dbus, journald, sshd, resctl-demo
  • workload.slice : 75% low - hashd
  • sideload.slice : No configuration
  • system.slice : No configuration

All we're doing is setting up overall protections for the important parts of the system in top-level slices. The numbers just need to be reasonable ballparks. There isn't anything workload-dependent. All it's saying is there are some critical parts of the system, and most of the memory should be used to run the main workloads. As such, the same configuration can serve a wide variety of use cases, as long as the system's usage requirements are similar.

In the above configuration, hostcritical is rather large at 512M. This is because the management agent and UI for this demo live under hostcritical and need to stay responsive at all times. In more typical setups, hostcritical is set multiple times lower.

Memory Protection in Action

Let's repeat a similar experiment as in the previous "Cgroup and Resource Protection" section to demonstrate memory protection. rd-hashd should already be running at full load. Once it warms up, click the Disable memory control and start a compile job button to disable memory control and start a Linux compile job with a ludicrous level of concurrency which will viciously compete for memory.

Once the source tree is untarred and the compile commands start getting spawned, system's memory pressure will shoot up. Soon after, workload's pressure will start climbing and rd-hashd's RPS slumping. The degree will depend on the performance of the IO device. On a very performant SSD such as on AWS c5d.9xlarge machine type, IO control alone may be able to protect rd-hashd for the most part in this scenario. click the Stop the compile job and restore memory control button, and let's stop the compile job and restore memory control.

Wait for the sysload count to drop to zero and rd-hashd to stabilize. Once rd-hashd's RPS is stable and memory footprint stops increasing, click the Start the compile job button to launch the same compile job again.

RPS drops a bit, which is expected - we want the management portion to be able to use a small fraction of the system, but rd-hashd will stay close to its full load while the malfunctioning system is throttled so that it can't overwhelm the system.

Read On

This page described memory shortage behaviors and briefly explained how they happen, but because memory management is a critical and challenging part in understanding and debugging resource issues, we'll delve into more details on the next page.