Details on Sideloading
Let's delve into some details that we skimmed over previously.
CPU Utilization and Actual Amount of Work Done
hashd RPS can be a metric for the actual computation done. Each request calculates sha1 of data blocks, and the numbers of bytes follow a normal distribution. While there's some influence from increasing the memory footprint as RPS grows, the differences are minor, usually a low single-digit percentage. If a machine is specified to perform 100 RPS when all CPUs are fully saturated, at 50 RPS, it would be doing around half the total computation it can.
Previously, we noted that the usual CPU utilization percentage, measured in wallclock time, doesn't scale linearly with the total amount of work the CPUs can do. We can observe this relationship by varying the load level of hashd.
hashd is already running at 60% load, and the left graph panel shows the RPS / CPU util graph instead of the usual RPS / latency. The plots for the two values are directly comparable; the y-axes for both are on the same percentage scale.
Look at the CPU utilization: It's likely significantly lower, though it will vary by CPU. Now, increase the load level gradually with the hashd load slider and see how it changes.
On most hardware, CPU utilization will stay significantly lower than the load level until the load level crosses 80% or 90%, then it quickly catches up. How they exactly map will depend on the specific hardware and workload.
Let's reset and continue on to the next section by resetting the load level to 60% by selecting the Reset load level to 60% button.
CPU sub-resource Contention
Let's see whether we can demonstrate the effect of CPU sub-resource contention.
The RPS determines how much computation rd-hashd is doing. While memory and IO activities have some effect on CPU usage, the effect isn't significant unless the system is under heavy memory pressure. So, we can use RPS to measure for the total amount of work the CPUs are doing.
rd-hashd should already be running at 60% load. Once it warms up, note the CPU utilization level of workload: It should be fairly stable. Now, let's start the Linux build job as sysload - no CPU headroom - with relaxed rd-hashd latency target by selecting the Relax rd-hashd latency target and start linux build sysload button and see how that changes.
Wait until the compile phase starts and the system's CPU utilization rises and stabilizes. Compare workload's current CPU utilization to before: The RPS didn't change, but its CPU utilization rose. The CPUs are taking significantly more time doing the same amount of work. This is one of the major contributing factors to the increased latency.
Let's start it as a sideload with CPU headroom and see whether there's any difference by selecting the Reset latency target, stop linux build sysload and start it as sideload button.
Once the compile phase starts, the workload's CPU utilization rises, but noticeably less than the prior attempt without CPU headroom. You can tune the headroom amount with the CPU headroom slider. Nudge it up and down, and observe how workload's CPU utilization responds.
The specifics vary by CPU, but the relationship between headroom and main workload latency usually resembles a hockey stick curve. As headroom is reduced, there's a point where latency impact starts increasing noticeably. This is also the point where the CPUs are actually starting to get saturated, and where increasing the amount of work contributes more to overall slower execution rather than increased bandwidth.
How Much Actual Work is the Sideload Doing?
Pushing up utilization with sideloading is nice, but how much actual work is it getting out of the system? Let's compare the completion times of a shorter build job when it can take up the whole system vs. running as a sideload. Select the Stop hashd and start allnoconfig linux build sysload button.
Monitor the progress in the "other logs" pane on the left. Depending on the machine, the build will take some tens of seconds. When the job finishes, it prints out how long the compilation part took, in a line similar to "Compilation took 10 seconds". If it's difficult to find in the left pane, open log view with 'l', select rd-sysload-compile-job, and record the duration. Our baseline is the time it takes to build allnoconfig kernel, when it can take up the whole machine.
Now, let's try running it as a sideload. But, first, start hashd at 60% load using the Start hashd at 60% load knob.
Let it ramp up to the target load level. As our only interest is CPU, we don't need to wait for the memory footprint to grow. Now, let's start the build job again. Click the Start allnoconfig linux build sideload button.
Wait for it to finish and note the time as before. The log for this run is in rd-sideload-compile-job.
On a test machine with AMD Ryzen 7 3800X (8 cores and 16 threads), the full machine run took 10s, while the sideloaded one took 30s. The number is skewed against the full machine run because the build job is so short and there are phases that aren't parallel, but we could get around 1/3 of full machine capacity while running it as a sideload, which seems roughly in the ballpark given that the main workload was running at 60% of full machine capacity, but kinda high given that we were running with 20% headroom.
Let's try the same thing with a longer build job. If you're short on time, feel free to skip the following experiment and just read the results below from my test machine.
Click the Stop hashd and start defconfig linux build sysload button.
Wait for completion and note of how long the compilation took and then start hashd at 60% load by selecting the Start hashd at 60% load button.
Once it warms up, start the same build job as a sideload by selecting the Start defconfig linux build sideload button.
On the test machine, the full machine run took 81 seconds; the sideload run 305 seconds. That's ~27%. 60% for hashd + 27% for the sideload adds up to 87% - still higher than expected given the 20% headroom. While experiment errors could contribute some, the total amount of work done being higher than raw utilization number is expected, given that the machine reaches saturation before wallclock-measured utilization hits 100%.
This result indicates that we can fully utilize the machine without sacrificing much. The only cost we had to pay was less than 5% increase in latency, and we got more than 25% extra work out of the machine, which was already 60% loaded - a significant bang for the buck. If the average utilization in your fleet is lower, which is often the case, the bang is even bigger.
Read On
We examined the CPU utilization number and the actual amount of work done, CPU sub-resource contention, and how much extra work we can extract with sideloads. Use the Experiment with Sideloading section in the demo to test your sideloading scenarios.