The BPF verifier guarantees that program itself is safe for the kernel to execute, but in order to use BPF as a whole safely and surprise free the users need to understand the lifetime of BPF programs and maps. This post covers these details in depth.
File descriptors and Reference counters
BPF objects (progs, maps, and debug info) are accessed by user space via file descriptors (FDs), and each object has a reference counter. For example, when a map is created with a call to bpf_create_map(), the kernel allocates a
struct bpf_map object. The kernel then initializes its refcnt to 1, and returns a file descriptor to the user space process. If the process exits or crashes right after, the FD would be closed, and the refcnt of the bpf map object will be decremented. At this point the refcnt will be zero, which will trigger a memory free after the RCU grace period.
BPF programs that use BPF maps are loaded in two phases. First, the maps are created and their FDs (file descriptors) are stored as part of the BPF program in the 'imm' field of BPF_LD_IMM64 instructions. When the kernel verifies the program, it increments refcnt of maps used by the program, and initializes program's refcnt to 1. At this point user space can close FDs associated with maps, but maps will not disappear since the program is “using” them (though the program is not yet attached anywhere). When prog FD is closed, and refcnt reaches zero, the destruction logic will iterate over all maps used by the program and will decrement their refcnts. This scheme allows the same map to be used by multiple programs at once (even of different program types). For example, a tracing program attached to a tracepoint can collect data into a bpf map, while a networking program uses that information to make forwarding decisions.
When a program is attached to some hook, the refcnt of the program is incremented. The user space process that originally created the BPF maps+program, and then loaded and finally attached the program to the hook, can now exit. At this point the maps+program that were created by user space will stay alive, since the program has a refcnt > 0. This is the BPF object lifecycle! As long as the refcnt of a BPF object (program or map) is > 0, the kernel will keep it alive.
Not all attachment points are made equal though. XDP, tc's clsact, and cgroup-based hooks are global. Programs will stay attached to global attachment points for as long as those objects are alive. In case of clsact, the program is attached to an ingress or egress qdisc. If nothing is doing (for example)
tc qdisc del, the program will be processing ingress or egress packets even when there is no corresponding user space process. On the other hand, you may have programs attached to (for example) tracing hooks that will only run for the lifetime of the process that holds FD to tracing event. Say for example your program gets an FD from perf_event_open() and then does ioctl(perf_event_fd, IOC_SET_BPF, bpf_prog_fd) (or, alternatively, gets an FD from bpf_raw_tracepoint_open(“tracepoint_name”, bpf_prog_fd)). These FDs are local to the process, and therefor if the process crashes the perf_event_fd or bpf_raw_tracepoint_fd will be closed. In this scenario, the kernel will detach the bpf program and decrement its refcnt.
In summary: xdp, tc, lwt, cgroup hooks are “global”, whereas kprobe, uprobe, tracepoint, perf_event, raw_tracepoint, socket filters, so_reuseport hooks are “local” to the process, since they're accessed via FD. Note that people have requested a new type of cgroup object that is FD based, so in the future it is possible that cgroup-bpf might be both “global” as well as “local”. Moving on. The main advantage of the FD based interface is auto-cleanup, meaning that if anything goes wrong with the user space process the kernel will clean up all objects automatically. The original BPF API was FD based from the beginning. However, while deploying kprobe/uprobe + bpf in production it became clear that using the global interface of [ku]probe is too cumbersome to keep working around. Hence, a FD based [ku]probe API was introduced in the kernel. A similar situation may happen soon with the cgroup API. There could be a cgroup object that a process holds with an FD, and attaches BPF programs to that FD (object) instead of to a global cgroup entity. The FD based API is useful for networking as well. Some time ago a Facebook Widely Deployed Binary (WDB) had a bug where it 'forgot' to cleanup tc's clsact bpf program. As a result, after many daemon restarts there were thousands of bpf programs attached to the same clsact egress hook doing the same work. All of these programs except one were running without a corresponding user space process and were wasting cpu cycles. Eventually this problem was noticed when overall system performance gradually degraded. Using a FD based networking API would have prevented such situation. Note that there is ongoing work to introduce a new tc clsact-like API which adds hooks into ingress and egress paths of network devices, but is not actually based on tc and at the same time is FD based with auto-cleanup properties.
There is another way to keep BPF programs and maps alive. It's called BPFFS, or, “BPF File System”. A user space process can “pin” a BPF program or map in BPFFS using an arbitrary name. Pinning a BPF object in BPFFS will increment the refcnt of the object, which will result in the BPF object staying alive, even if the pinned BPF program is not attached anywhere or the pinned BPF map is not used by any program. An example of where this is useful comes to us from networking. For networking you may have BPF programs that can do their packet processing duties without a user space daemon, but the admin may want to login to the host and examine the map from time to time. The admin could bpf_obj_get(“path”) the object from BPFFS, which will return a new FD (and corresponding handle to the object). To unpin the object, just delete the file in BPFFS with unlink() and the kernel will decrement corresponding refcnt. In summary:
obj create → refcnt=1 attach → refcnt++, detach →refcnt-- obj_pin → refcnt++, unlink →refcnt-- close → refcnt--
Detach and replace
An important part of understanding the lifetime of a BPF program is detach. The detach hook will prevent execution of a previously attached program from any future events. The programs that are already executing will complete after the detach command returned. Another important aspect of program lifetime is replace. cgroup-bpf hooks allow a program to be replaced. The kernel guarantees that one of the old or new program will be processing events, but there is a window where it's possible that old program is executing on one CPU and new program for the same hook is executing on another CPU. There is no “atomic” replacement. To close this window, some sophisticated BPF developers use the following scheme. The new program is loaded with the same maps that the old program is using, so at replacement time the operational data is the same and thus only program text is being replaced. When old/new are similar enough and the new program doesn't introduce new data structures this approach works very well. For example, old program could be compiled with debug off while new program is essentially the same program but with debug on. Such program replacement is safe to do despite both programs running on different cpus at the same time.