Observability of filesystem syncing and device flushing

* Observability of filesystem syncing and device flushing
@ 2021-06-16  4:59 Gregory Szorc
  0 siblings, 0 replies; only message in thread
From: Gregory Szorc @ 2021-06-16  4:59 UTC (permalink / raw)
  To: linux-block, linux-fsdevel

One thing I've learned from optimizing I/O performance of userland
applications is that a surprising number of applications are
effectively "fsync() benchmarks." What I mean by that is many
applications call fsync() - and similar functionality incurring a
flush to non-volatile storage - surprisingly often and that overall
I/O performance is effectively bottlenecked by how quickly the
underlying storage device can process repeated flushes.

You can easily confirm the validity of this statement by running
sync/flush heavy software (like pretty much any database software)
with eatmydata [1], an LD_PRELOAD library / utility that effectively
short-circuits libc functions like fsync(), fdatasync(), and msync()
so they no-op instead of triggering a sync/flush [via a syscall].

The kernel exposes various statistics about block devices,
filesystems, the page cache, etc via procfs. However, as far as I can
tell there are no direct or very limited statistics on sync/flush
operations. I've often wanted to capture/view system/kernel-level
activity for these sync/flush operations because they are often strong
contributors to I/O bottlenecks and it can be extremely useful to
correlate these operations against other metrics.

I've long considered sending kernel patches to add procfs
observability for sync/flushing activity so common system monitoring
tools can easily consume those metrics. I'm emailing linux-block@ and
linux-fsdevel@ to a) try to assess whether these patches would be
well-received b) gain feedback on the appropriate activity to track
(I'm not a kernel expert).

I think a reasonable response is "you can already observe this
activity via syscall tracing and using tools like systemtap, so no
additional procfs counters are needed." I'd counter by saying, yes,
you can get some valuable information this way. But I believe there
are some large holes, such as limited visibility into block layer
activity outside of debugfs and when kernel subsystems (such as
background flusher threads) incur work independently of an observable
syscall.

Do others feel there is an observability gap in the kernel around
filesystem syncing and storage flushing? If so, how would you
recommend improving matters?

Gregory

[1] https://www.flamingspork.com/projects/libeatmydata/

^ permalink raw reply	[flat|nested] only message in thread