Regression in io_uring, leading to data corruption

* Regression in io_uring, leading to data corruption
@ 2023-11-07 16:34 Timothy Pearson
  2023-11-07 16:49 ` Jens Axboe
  2023-11-07 21:22 ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 16:34 UTC (permalink / raw)
  To: regressions, Jens Axboe, Pavel Begunkov

I have spent some considerable effort tracking down a bug that appears to be present in the io_uring workqueue.  As I have not yet been able to isolate the exact cause, I would like to solicit ideas from the developers / maintainers of the io_uring system.  This regression persists into the latest kernel GIT head, and is only reliably reproduceable under fairly exacting conditions.

In GIT hash 685fe7fe the workqueue manager thread was removed and replaced with code that allows the workqueues to manage their own workers.  This has the unfortunate side effect of exposing what I believe to be an existing timing-dependent race condition somewhere else within the kernel.  On a ppc64el host, I can reliably trigger data corruption on what I believe to be writes by running the following mysql mtr sequence:

./mtr encryption.innodb-discard-import --repeat=100 --force

This results in corruption of the data being written to disk -- reverting 685fe7fe resolves the issue by (I believe) masking it through changes in workqueue inter-thread timing.

I can make the corruption disappear by adding a 1ms busy wait delay into io_wqe_dec_running().  This appears to alter the timing of something in the io_uring system just enough to make the (presumed) data race disappear.  KASAN and KCSAN do not show any issues, nor does the lock debugger, yet a corruption problem that disappears with a delay is indicative of a race somewhere.  The delay primary impacts how long the IRQ lock is held, if the delay is moved outside of the IRQ locked section the corruption returns.

I have already tried adding memory barriers etc. to the code paths in question, with no effect.  The exact same issue persists on the latest kernel versions.

Thoughts welcome -- this is a serious issue causing data corruption on production systems.

Thank you!

^ permalink raw reply	[flat|nested] 95+ messages in thread