Btrfs at Facebook
Facebook is growing rapidly, and like all cloud computing services faces resource utilization challenges.
This page describes some of the solutions Btrfs is providing in production now in Facebook's data centers.
Container isolation with Btrfs snapshots and images
Tupperware is Facebook's containerization solution. Tupperware previously set up a container filesystem from scratch every time it started a task, which was expensive in terms of IO and CPU.
Using Btrfs' snapshot features, Tupperware can now snapshot a preexisting Btrfs image.
There were three main problems the Tupperware Btrfs implemenation tried to solve:
Lack of true process isolation: The creation of a container filesystem from scratch gives each task a separate view of the filesystem, but this view is not truly isolated from other tasks running on the machine. One task can easily affect the root filesystem of another task on the machine, without the user explicitly configuring the tasks to share pieces of the filesystem. Even worse, a task can overwrite data stored in packages used by other tasks.
High IO expense for administrative tasks: A large number of IO operations were required to restart, move, or update Tupperware. This IO was necessary even if all the data needed by the task was already on the destination machine. These IOs were happening during task starts, which makes matters worse as the task is down while we perform these operations. There were also a number of expensive checks that Tupperware did just to ensure that other tasks didn't break anything on the machine, due to the poor isolation.
Determinism: In the existing setup Tupperware does lots of IO operations to prepare the task to run. Each of these operations can fail in many different ways. In addition, the configuration for these operations is not fully contained in the task specification, but also comes from outside configuration programs. This creates a situation where lots of variables have to align in just the right way for the setup to work.
The image-based setup truly isolates tasks. One task cannot accidentally change the filesystem of another task on the host. In addition, a setup like this opens the door to true multi-tenancy, where different tasks run on the same server.
Only a small number of IO operations are needed in the image-based setup if all data needed to start the task is already on the host. Moreover, it's possible to effectively share the common pieces of the root filesystem between the tasks while having full isolation. Since task isolation is also now possible, checks that were needed only because of poor isolation are no longer needed, eliminating that IO overhead.
The image-based approach allows the encapsulation of all filesystem configuration of the task in the task spec itself. No matter where or when the task starts the same operations are performed. An additional benefit is greater configurability left in the hands of the task owner. The new setup is also designed around very few atomic operations: they either fully work and always produce the same result or fail completely.
Because snapshotting in Btrfs is a constant-time operation, IO and CPU expense involved in container setup was drastically reduced. Today a large percentage of Tupperware tasks use Btrfs images.
IO control with cgroup2
cgroup2 is a Linux kernel mechanism to group workloads and control the amount of system resources allocated to each group. Increasing IO efficiency of the Tupperware shared pool with cgroup2 was another use case in which Btrfs played a critical role.
cgroup2 contains an IO controller, that enables the specification of IO limits and thresholds for a given cgroup.
The Tupperware shared pool used ext4 as a filesystem.
ext4 is journaled, which means that whenever it tries to modify data on the disk, it first writes to the on-disk journal to indicate that it’s about to make the change. After the change is committed to the disk, it creates another journal entry for the completion. This allows fast and graceful recoveries if the system crashes or loses power while those modifications are in progress.
Unfortunately, due to mostly historical reasons, journal updates in ext4 may have to wait for some associated data writes to finish. In the case of the Tupperware shared pool, this resulted in priority inversions.
The only practical solution in this case was to use a filesystem that lacks the additional steps for journalling, and Btrfs provided the solution. With Btrfs in place, prioroity inversions caused by the extra steps for journalling were completely eliminated, allowing for comprehensive IO control with cgroup2.
Btrfs implementation also brought the additional benefits of data integrity, fast rollback, and easier administration.
Nearly 100% of the Tupperware shared pool now uses Btrfs, while benefiting from IO control through cgroup2.
Transparent file compression for more disk capacity
Btrfs supports transparent file compression. There are three algorithms available:
Compression in Btrfs is on a file-by-file basis, so you can have, for example, a single Btrfs mount point with a mixture of files compressed using any or all of the available compression alogrithms, along with uncompressed files.
To increase capacity, developer virtual machines now use Btrfs' transparent compression support with the Zstd algorithm (another Facebook open source technology), to provide more space on developer VMs.
Snapshotting to address capacity issues
Sandcastle, Facebook's continuous integration system, is a system that builds, tests, and lands code, and is one of the largest services running at Facebook.
A key problem was that making a fresh copy of the repository before a build and test and then deleting it afterwards took a significant amount of time, resulting in a capacity crunch for Sandcastle. This problem made Sandcastle another great candidate for Btrfs snapshotting.
Btrfs allowed the replacement of the recursive copy and delete of the repository with a snapshot and snapshot delete. Snapshotting is fast, and snapshot deletion in Btrfs is much more efficient than
rm -r because we can clean things up at the block level in the background instead of file-by-file.
Snapshotting provided a huge win for Sandcastle, increasing system availability by reducing the amount of time needed to copy repositories.
Btrfs at Facebook has come a long way, but new plans are already in the works.
Btrfs' incremental backup support (aka send/receive) is now being implemented for updates to Tupperware images, saving even more network bandwidth and IO.
In addition, projects exploiting the following Btrfs features are also underway:
Incremental backup system for more efficient backups.
Data deduplication to save even more disk space.
Quota support for better support of multi-tenant machines.
Checksum scrubbing to more proactively monitor disk health.
Our long-term goal is for every machine at Facebook to have a Btrfs root filesystem.