Monitoring Resource Contention with PSI

When a workload is struggling and barely making any progress, we know something's wrong with it, but exactly what's wrong can be tricky to determine. For example, if the IO device is saturated, maybe the workload is IO bottlenecked, or maybe it's thrashing on memory. To analyze different cases, we can delve into memory management, IO, and other statistics. Sometimes only one statistic is clearly spiking and we can conclude from that, but what if there are multiple indicators? How would we determine the degree to which different causes are contributing? This is even more challenging for workloads experiencing moderate resource contention - where they run mostly okay but not at maximum capacity. Determining whether the workload is running slower, by how much, and for what reasons becomes challenging, especially when other workloads are sharing the system.

PSI (Pressure Stall Information) is a way to measure resource pressure, which measures how much the workload has been slowed down due to the lack of a given resource. For example, memory pressure of 20% indicates that the workload could have run 20% faster if it had access to more memory. Resource pressure is defined for all three major local resources: CPU, memory, and IO. It's measured system-wide and per-cgroup, and available in /proc/pressure/{cpu, memory, io} and /sys/fs/cgroup/CGROUP/{cpu, memory,io}.pressure respectively.

There are two types of resource pressure: full and some. full measures the duration of time during which available CPU bandwidth couldn't be consumed because all available threads were blocked on the resource; thus, full pressure indicates computation bandwidth loss. some measures the duration of time during which at least some threads were blocked on the resource. A fully blocked scope is always some blocked. some indicates latency impact. For CPU, full pressure is not defined because full pressure would be defined as loss of CPU bandwidth which can't be caused by CPU contention.

In the top-right pane, the columns "cpuP%", "memP%" and "ioP%" show, respectively, CPU some pressure, memory full pressure, and IO full pressure. There are graphs for both full and some pressure in the graph view ('g').

PSI in action

Let's see how PSI metrics behave. rd-hashd is running at full load. Wait for the memory usage to stop climbing. It should be showing zero or a very low level of pressure for memory and IO. Now, slowly increase rd-hashd's memory footprint by selecting the Memory footprint slider.

Soon, RPS starts falling, and both memory and IO pressure go up. Set the memory footprint, so the workload is suffering, but still serving a meaningful level of load.

Switch to graph view ('g') and look at the RPS and pressure graphs. Notice how the RPS drops match memory and IO pressure spikes. If you add RPS % and pressure %, it won't stray too far from 100%. This is because we increased the memory footprint to the point where the workload no longer fits in the available memory and starts thrashing. It can no longer fully utilize CPU cycles because it needs to wait for memory too often and for too long. CPU cycles lost this way are counted as memory pressure. Since that's the only way we're losing capacity in this case, the load level and memory pressure will roughly add up to 100%.

Note that IO pressure moves together with memory pressure in general but sometimes registers lower than memory. When a system is short on memory, the kernel needs to keep scanning memory evicting cold pages and bringing back pages needed to progress from the filesystems and swap. Waiting for the pages to be read back from the IO device takes up the lion's share of lost time, so the two pressure numbers move in tandem if the only source of the slowdown is memory thrashing.

When would IO pressure go up, but not memory? This occurs when a workload is slowed down waiting for IOs, but providing it with more memory wouldn't reduce the amount of IOs. This is what happens when rd-hashd starts. Most memory in the system is idle and available, but rd-hashd needs to load files to build the hot working set. Giving it more memory wouldn't speed it up one bit. It still needs to load what it needs. Let's see how this behaves. First, click the Stop rd-hashd button to stop rd-hashd. Then, click the Start rd-hashd w/ higher page cache portion button to start rd-hashd, targeting full load with page cache proportion increased so that it needs to load more from files.

Watch how only IO pressure spikes as RPS ramps up and then comes down gradually as it reaches full load. This is because it starts with cold cache and is mostly bottlenecked on the IO device. As the hot working set gets established, it's no longer held back by IO, and IO pressure dissipates.

Stall Amounts: Full and Some

Let's restore the default rd-hashd parameters by clicking the Reset rd-hashd parameters and load to 90% button, and let it stabilize at 90% load.

Watch the "workload" row in the top-left panel, rd-hashd should be staying close to 90% load level. Slowly push up the Memory footprint knob to grow the memory requirement until the load level falls to around 80%. Notice how latency starts going up before RPS starts falling by adjusting the memory footprint slider.

Once it stabilizes, open graph view ('g') and look at some stat for CPU pressure and the full and some stats for memory and IO in their respective pressure graphs. Full memory and IO pressure should be around 10% reflecting the bandwidth loss, while some pressures chart further raised lines. The workload is functioning at a lowered bandwidth with raised latency: from ~50ms to ~75ms. The former is captured by full pressures, the latter some.

Read On

Now that you have a basic understanding of resource pressure, let's check out one of its important use cases: OOMD.

PSI in action​

Stall Amounts: Full and Some​

Read On​

PSI in action

Stall Amounts: Full and Some

Read On