Linux 7.2 Can Significantly Lower Container Exit/Unmount Latency

Alibaba engineer Baokun Li tracked down the possible race condition when a container exits and addressed it with the now-merged patch. That portion of the work should also be back-ported to current Linux stable kernel series in the near future. What's most exciting though is the additional work that eliminates a global serialization penalty and can lead to much lower container exit/unmount latency.
Christian Brauner summed up the situation in this pull request that is now merged for Linux 7.2:
"Fix a race between cgroup_writeback_umount() and inode_switch_wbs()
When a container exits, a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes after unmount" followed by a use-after-free on percpu counters. There is a window between inode_prepare_wbs_switch() returning true (having passed the SB_ACTIVE check and grabbed the inode) and the subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes the global isw_nr_in_flight counter as non-zero but flush_workqueue() finds nothing queued yet, it returns early - leaving a held inode reference that blocks evict_inodes() and a later iput() that hits freed percpu counters.
The race is closed by covering the window from inode_prepare_wbs_switch() through wb_queue_isw() with an RCU read-side critical section and synchronizing in the umount path. On top of that the now-dead rcu_barrier() left over from the queue_rcu_work() era is removed, and the global synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb in-flight counter plus pin/unpin/drain helpers so umount no longer serializes against switch activity on unrelated superblocks.
Under cgroup writeback churn on a 16 vCPU guest this takes umount latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost of cgroup_writeback_umount() from ~62ms to ~4us per call. The initial race fix is kept separate and minimal so it backports cleanly to stable trees that still queue switches via queue_rcu_work()."
Quite a nice improvement for the unmount latency.
There are also additional benchmark numbers from this patch.
Separately, that same VFS pull request for Linux 7.2 also improves write performance when using the RWF_DONTCACHE flag. Those benchmark numbers and more details within this patch.
3 Comments
