All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in io_uring, leading to data corruption
@ 2023-11-07 16:34 Timothy Pearson
  2023-11-07 16:49 ` Jens Axboe
  2023-11-07 21:22 ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 16:34 UTC (permalink / raw)
  To: regressions, Jens Axboe, Pavel Begunkov

I have spent some considerable effort tracking down a bug that appears to be present in the io_uring workqueue.  As I have not yet been able to isolate the exact cause, I would like to solicit ideas from the developers / maintainers of the io_uring system.  This regression persists into the latest kernel GIT head, and is only reliably reproduceable under fairly exacting conditions.

In GIT hash 685fe7fe the workqueue manager thread was removed and replaced with code that allows the workqueues to manage their own workers.  This has the unfortunate side effect of exposing what I believe to be an existing timing-dependent race condition somewhere else within the kernel.  On a ppc64el host, I can reliably trigger data corruption on what I believe to be writes by running the following mysql mtr sequence:

./mtr encryption.innodb-discard-import --repeat=100 --force

This results in corruption of the data being written to disk -- reverting 685fe7fe resolves the issue by (I believe) masking it through changes in workqueue inter-thread timing.

I can make the corruption disappear by adding a 1ms busy wait delay into io_wqe_dec_running().  This appears to alter the timing of something in the io_uring system just enough to make the (presumed) data race disappear.  KASAN and KCSAN do not show any issues, nor does the lock debugger, yet a corruption problem that disappears with a delay is indicative of a race somewhere.  The delay primary impacts how long the IRQ lock is held, if the delay is moved outside of the IRQ locked section the corruption returns.

I have already tried adding memory barriers etc. to the code paths in question, with no effect.  The exact same issue persists on the latest kernel versions.

Thoughts welcome -- this is a serious issue causing data corruption on production systems.

Thank you!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
@ 2023-11-07 16:49 ` Jens Axboe
  2023-11-07 16:57   ` Timothy Pearson
  2023-11-07 21:22 ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 16:49 UTC (permalink / raw)
  To: Timothy Pearson, regressions, Pavel Begunkov

On 11/7/23 9:34 AM, Timothy Pearson wrote:
> I have spent some considerable effort tracking down a bug that appears
> to be present in the io_uring workqueue.  As I have not yet been able
> to isolate the exact cause, I would like to solicit ideas from the
> developers / maintainers of the io_uring system.  This regression
> persists into the latest kernel GIT head, and is only reliably
> reproduceable under fairly exacting conditions.
> 
> In GIT hash 685fe7fe the workqueue manager thread was removed and
> replaced with code that allows the workqueues to manage their own
> workers.  This has the unfortunate side effect of exposing what I
> believe to be an existing timing-dependent race condition somewhere
> else within the kernel.  On a ppc64el host, I can reliably trigger
> data corruption on what I believe to be writes by running the
> following mysql mtr sequence:
> 
> ./mtr encryption.innodb-discard-import --repeat=100 --force
> 
> This results in corruption of the data being written to disk --
> reverting 685fe7fe resolves the issue by (I believe) masking it
> through changes in workqueue inter-thread timing.
> 
> I can make the corruption disappear by adding a 1ms busy wait delay
> into io_wqe_dec_running().  This appears to alter the timing of
> something in the io_uring system just enough to make the (presumed)
> data race disappear.  KASAN and KCSAN do not show any issues, nor does
> the lock debugger, yet a corruption problem that disappears with a
> delay is indicative of a race somewhere.  The delay primary impacts
> how long the IRQ lock is held, if the delay is moved outside of the
> IRQ locked section the corruption returns.
> 
> I have already tried adding memory barriers etc. to the code paths in
> question, with no effect.  The exact same issue persists on the latest
> kernel versions.
> 
> Thoughts welcome -- this is a serious issue causing data corruption on
> production systems.

I looked into this for quite a while back in March, see my initial
postings on it here:

https://lore.kernel.org/all/2b015a34-220e-674e-7301-2cf17ef45ed9@kernel.dk/

it unfortunately never got anywhere, and as far as I can tell, this is
most likely a page cache or ordering issue on the ppc side. I no longer
have hardware to test with, and not really a huge inclination to dive
into this again as it's hugely time consuming and doesn't seem to be an
io_uring issue to begin with, but I'd be happy to help out with this.

Back then I looked into getting some ppc hardware to test with for
this very reason, and even reached out to various manufacturers to see
if they would be able to lend/give me some. Didn't pan out, and ended
up using a university vm for it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 16:49 ` Jens Axboe
@ 2023-11-07 16:57   ` Timothy Pearson
  2023-11-07 17:14     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 16:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
> <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 10:49:34 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>> I have spent some considerable effort tracking down a bug that appears
>> to be present in the io_uring workqueue.  As I have not yet been able
>> to isolate the exact cause, I would like to solicit ideas from the
>> developers / maintainers of the io_uring system.  This regression
>> persists into the latest kernel GIT head, and is only reliably
>> reproduceable under fairly exacting conditions.
>> 
>> In GIT hash 685fe7fe the workqueue manager thread was removed and
>> replaced with code that allows the workqueues to manage their own
>> workers.  This has the unfortunate side effect of exposing what I
>> believe to be an existing timing-dependent race condition somewhere
>> else within the kernel.  On a ppc64el host, I can reliably trigger
>> data corruption on what I believe to be writes by running the
>> following mysql mtr sequence:
>> 
>> ./mtr encryption.innodb-discard-import --repeat=100 --force
>> 
>> This results in corruption of the data being written to disk --
>> reverting 685fe7fe resolves the issue by (I believe) masking it
>> through changes in workqueue inter-thread timing.
>> 
>> I can make the corruption disappear by adding a 1ms busy wait delay
>> into io_wqe_dec_running().  This appears to alter the timing of
>> something in the io_uring system just enough to make the (presumed)
>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>> the lock debugger, yet a corruption problem that disappears with a
>> delay is indicative of a race somewhere.  The delay primary impacts
>> how long the IRQ lock is held, if the delay is moved outside of the
>> IRQ locked section the corruption returns.
>> 
>> I have already tried adding memory barriers etc. to the code paths in
>> question, with no effect.  The exact same issue persists on the latest
>> kernel versions.
>> 
>> Thoughts welcome -- this is a serious issue causing data corruption on
>> production systems.
> 
> I looked into this for quite a while back in March, see my initial
> postings on it here:
> 
> https://lore.kernel.org/all/2b015a34-220e-674e-7301-2cf17ef45ed9@kernel.dk/
> 
> it unfortunately never got anywhere, and as far as I can tell, this is
> most likely a page cache or ordering issue on the ppc side. I no longer
> have hardware to test with, and not really a huge inclination to dive
> into this again as it's hugely time consuming and doesn't seem to be an
> io_uring issue to begin with, but I'd be happy to help out with this.
> 
> Back then I looked into getting some ppc hardware to test with for
> this very reason, and even reached out to various manufacturers to see
> if they would be able to lend/give me some. Didn't pan out, and ended
> up using a university vm for it.
> 
> --
> Jens Axboe

Understood.  I think between the pinning and the findings above, plus the fact that (IIRC) this seemed to disappear in SMT1 mode, I may have some better idea of where to look.  The pinning "fixing" things is something I wasn't aware of and will significantly reduce debug effort on this end, thanks for the pointer!

In the future, Raptor is more than willing to offer bare metal access to test machines for ppc64el at no cost.  I was unaware of the need so couldn't respond.

Thanks again!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 16:57   ` Timothy Pearson
@ 2023-11-07 17:14     ` Jens Axboe
  0 siblings, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 17:14 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 9:57 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 10:49:34 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>>> I have spent some considerable effort tracking down a bug that appears
>>> to be present in the io_uring workqueue.  As I have not yet been able
>>> to isolate the exact cause, I would like to solicit ideas from the
>>> developers / maintainers of the io_uring system.  This regression
>>> persists into the latest kernel GIT head, and is only reliably
>>> reproduceable under fairly exacting conditions.
>>>
>>> In GIT hash 685fe7fe the workqueue manager thread was removed and
>>> replaced with code that allows the workqueues to manage their own
>>> workers.  This has the unfortunate side effect of exposing what I
>>> believe to be an existing timing-dependent race condition somewhere
>>> else within the kernel.  On a ppc64el host, I can reliably trigger
>>> data corruption on what I believe to be writes by running the
>>> following mysql mtr sequence:
>>>
>>> ./mtr encryption.innodb-discard-import --repeat=100 --force
>>>
>>> This results in corruption of the data being written to disk --
>>> reverting 685fe7fe resolves the issue by (I believe) masking it
>>> through changes in workqueue inter-thread timing.
>>>
>>> I can make the corruption disappear by adding a 1ms busy wait delay
>>> into io_wqe_dec_running().  This appears to alter the timing of
>>> something in the io_uring system just enough to make the (presumed)
>>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>>> the lock debugger, yet a corruption problem that disappears with a
>>> delay is indicative of a race somewhere.  The delay primary impacts
>>> how long the IRQ lock is held, if the delay is moved outside of the
>>> IRQ locked section the corruption returns.
>>>
>>> I have already tried adding memory barriers etc. to the code paths in
>>> question, with no effect.  The exact same issue persists on the latest
>>> kernel versions.
>>>
>>> Thoughts welcome -- this is a serious issue causing data corruption on
>>> production systems.
>>
>> I looked into this for quite a while back in March, see my initial
>> postings on it here:
>>
>> https://lore.kernel.org/all/2b015a34-220e-674e-7301-2cf17ef45ed9@kernel.dk/
>>
>> it unfortunately never got anywhere, and as far as I can tell, this is
>> most likely a page cache or ordering issue on the ppc side. I no longer
>> have hardware to test with, and not really a huge inclination to dive
>> into this again as it's hugely time consuming and doesn't seem to be an
>> io_uring issue to begin with, but I'd be happy to help out with this.
>>
>> Back then I looked into getting some ppc hardware to test with for
>> this very reason, and even reached out to various manufacturers to see
>> if they would be able to lend/give me some. Didn't pan out, and ended
>> up using a university vm for it.
>>
>> --
>> Jens Axboe
> 
> Understood.  I think between the pinning and the findings above, plus
> the fact that (IIRC) this seemed to disappear in SMT1 mode, I may have
> some better idea of where to look.  The pinning "fixing" things is
> something I wasn't aware of and will significantly reduce debug effort
> on this end, thanks for the pointer!

It's been some months since then so I don't recall all the details, but
at least there are some emails that cover some of it. I too tried a
bunch of things similar to what you looked at, but even a full hard
barrier before inserting the work item, and one before retrieving it on
the other end, didn't do anything. Hope you'll have better luck. And
like I said, I'm happy to help out, if I can.

> In the future, Raptor is more than willing to offer bare metal access
> to test machines for ppc64el at no cost.  I was unaware of the need so
> couldn't respond.

Good to know!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
  2023-11-07 16:49 ` Jens Axboe
@ 2023-11-07 21:22 ` Jens Axboe
  2023-11-07 21:39   ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 21:22 UTC (permalink / raw)
  To: Timothy Pearson, regressions, Pavel Begunkov

On 11/7/23 9:34 AM, Timothy Pearson wrote:
> I can make the corruption disappear by adding a 1ms busy wait delay
> into io_wqe_dec_running().  This appears to alter the timing of
> something in the io_uring system just enough to make the (presumed)
> data race disappear.  KASAN and KCSAN do not show any issues, nor does
> the lock debugger, yet a corruption problem that disappears with a
> delay is indicative of a race somewhere.  The delay primary impacts
> how long the IRQ lock is held, if the delay is moved outside of the
> IRQ locked section the corruption returns.

This is interesting... This forces the current task to schedule out
first, rather than potentially create a new worker.

If you put that 1 ms sleep at the top of io_wq_worker(), before the
loop, does it fix it as well? With the previous one removed, of course.

Does it do anything if you just put isync() in there instead of the
sleep in io_wqe_dec_running()?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 21:22 ` Jens Axboe
@ 2023-11-07 21:39   ` Timothy Pearson
  2023-11-07 21:46     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 21:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
> <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 3:22:16 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>> I can make the corruption disappear by adding a 1ms busy wait delay
>> into io_wqe_dec_running().  This appears to alter the timing of
>> something in the io_uring system just enough to make the (presumed)
>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>> the lock debugger, yet a corruption problem that disappears with a
>> delay is indicative of a race somewhere.  The delay primary impacts
>> how long the IRQ lock is held, if the delay is moved outside of the
>> IRQ locked section the corruption returns.
> 
> This is interesting... This forces the current task to schedule out
> first, rather than potentially create a new worker.

That makes sense -- when debugging, everything seems to center around worker creation and the exact timing / context of that operation.

> If you put that 1 ms sleep at the top of io_wq_worker(), before the
> loop, does it fix it as well? With the previous one removed, of course.

No, it does not.

> Does it do anything if you just put isync() in there instead of the
> sleep in io_wqe_dec_running()?

Unfortunately no, that doesn't fix things either.

My gut says it's something racing with either the I/O worker creation or wake (trying to schedule work onto a half-created or still sleeping worker?) but so far I haven't been able to isolate the root cause.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 21:39   ` Timothy Pearson
@ 2023-11-07 21:46     ` Jens Axboe
  2023-11-07 22:07       ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 21:46 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 2:39 PM, Timothy Pearson wrote:
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 3:22:16 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>>> I can make the corruption disappear by adding a 1ms busy wait delay
>>> into io_wqe_dec_running().  This appears to alter the timing of
>>> something in the io_uring system just enough to make the (presumed)
>>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>>> the lock debugger, yet a corruption problem that disappears with a
>>> delay is indicative of a race somewhere.  The delay primary impacts
>>> how long the IRQ lock is held, if the delay is moved outside of the
>>> IRQ locked section the corruption returns.
>>
>> This is interesting... This forces the current task to schedule out
>> first, rather than potentially create a new worker.
> 
> That makes sense -- when debugging, everything seems to center around
> worker creation and the exact timing / context of that operation.
> 
>> If you put that 1 ms sleep at the top of io_wq_worker(), before the
>> loop, does it fix it as well? With the previous one removed, of course.
> 
> No, it does not.

OK, so it's not related to delaying work handling for a new worker. What
if you add a 1ms sleep before calling wq->do_work() in
io_worker_handle_work()? For both this and your previous test, might be
worth shrinking this 1ms to something way smaller, as it's hard to know
if it's the fact that we're scheduling first that's important, or if
it's just a timing delay and hence now it's much harder to hit this
race.

>> Does it do anything if you just put isync() in there instead of the
>> sleep in io_wqe_dec_running()?
> 
> Unfortunately no, that doesn't fix things either.
> 
> My gut says it's something racing with either the I/O worker creation
> or wake (trying to schedule work onto a half-created or still sleeping
> worker?) but so far I haven't been able to isolate the root cause.

One of my initial suspicions was improper handling of signals, since
creating a new worker will happen via task_work_add(..., TWA_SIGNAL).
But I'm not sure how likely that is, as that could/would happen normally
too if signals are being used, and would trigger on more than just
powerpc obviously.

The workers are good to go as soon as they are created, no setup really
needs to happen. They do a bit at the top of io_wq_worker(), but that's
before they can go and grab a work item. The worker itself has to grab
it, nothing is being handed to it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 21:46     ` Jens Axboe
@ 2023-11-07 22:07       ` Timothy Pearson
  2023-11-07 22:16         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 22:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 3:46:40 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 2:39 PM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions"
>>> <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 3:22:16 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>>>> I can make the corruption disappear by adding a 1ms busy wait delay
>>>> into io_wqe_dec_running().  This appears to alter the timing of
>>>> something in the io_uring system just enough to make the (presumed)
>>>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>>>> the lock debugger, yet a corruption problem that disappears with a
>>>> delay is indicative of a race somewhere.  The delay primary impacts
>>>> how long the IRQ lock is held, if the delay is moved outside of the
>>>> IRQ locked section the corruption returns.
>>>
>>> This is interesting... This forces the current task to schedule out
>>> first, rather than potentially create a new worker.
>> 
>> That makes sense -- when debugging, everything seems to center around
>> worker creation and the exact timing / context of that operation.
>> 
>>> If you put that 1 ms sleep at the top of io_wq_worker(), before the
>>> loop, does it fix it as well? With the previous one removed, of course.
>> 
>> No, it does not.
> 
> OK, so it's not related to delaying work handling for a new worker. What
> if you add a 1ms sleep before calling wq->do_work() in
> io_worker_handle_work()? For both this and your previous test, might be
> worth shrinking this 1ms to something way smaller, as it's hard to know
> if it's the fact that we're scheduling first that's important, or if
> it's just a timing delay and hence now it's much harder to hit this
> race.

Nope, no joy.  I had already tried shrinking the delay and the issues came back.  Ditto for switching to usleep; allowing the process to leave the core (or something else to schedule on the core) seems to bring the corruption back despite the time delay remaining in place.

>>> Does it do anything if you just put isync() in there instead of the
>>> sleep in io_wqe_dec_running()?
>> 
>> Unfortunately no, that doesn't fix things either.
>> 
>> My gut says it's something racing with either the I/O worker creation
>> or wake (trying to schedule work onto a half-created or still sleeping
>> worker?) but so far I haven't been able to isolate the root cause.
> 
> One of my initial suspicions was improper handling of signals, since
> creating a new worker will happen via task_work_add(..., TWA_SIGNAL).
> But I'm not sure how likely that is, as that could/would happen normally
> too if signals are being used, and would trigger on more than just
> powerpc obviously.

Interestingly enough that's where my current investigation is leading as well.  After instrumenting and re-instrumenting the codebase far more times than I'd like to admit, I've noticed that the workers don't time out on kernel builds that show the corruption, they are directly terminated via signal (SIGKILL).  On kernel builds with the delay, the workers time out and self-terminate.  I'm still trying to parse out what the exact difference between these two mechanisms would be and how it plays into the corruption, but at least it's a start...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 22:07       ` Timothy Pearson
@ 2023-11-07 22:16         ` Jens Axboe
  2023-11-07 22:29           ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 22:16 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 3:07 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 3:46:40 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 2:39 PM, Timothy Pearson wrote:
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "regressions"
>>>> <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 3:22:16 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/7/23 9:34 AM, Timothy Pearson wrote:
>>>>> I can make the corruption disappear by adding a 1ms busy wait delay
>>>>> into io_wqe_dec_running().  This appears to alter the timing of
>>>>> something in the io_uring system just enough to make the (presumed)
>>>>> data race disappear.  KASAN and KCSAN do not show any issues, nor does
>>>>> the lock debugger, yet a corruption problem that disappears with a
>>>>> delay is indicative of a race somewhere.  The delay primary impacts
>>>>> how long the IRQ lock is held, if the delay is moved outside of the
>>>>> IRQ locked section the corruption returns.
>>>>
>>>> This is interesting... This forces the current task to schedule out
>>>> first, rather than potentially create a new worker.
>>>
>>> That makes sense -- when debugging, everything seems to center around
>>> worker creation and the exact timing / context of that operation.
>>>
>>>> If you put that 1 ms sleep at the top of io_wq_worker(), before the
>>>> loop, does it fix it as well? With the previous one removed, of course.
>>>
>>> No, it does not.
>>
>> OK, so it's not related to delaying work handling for a new worker. What
>> if you add a 1ms sleep before calling wq->do_work() in
>> io_worker_handle_work()? For both this and your previous test, might be
>> worth shrinking this 1ms to something way smaller, as it's hard to know
>> if it's the fact that we're scheduling first that's important, or if
>> it's just a timing delay and hence now it's much harder to hit this
>> race.
> 
> Nope, no joy.  I had already tried shrinking the delay and the issues
> came back.  Ditto for switching to usleep; allowing the process to
> leave the core (or something else to schedule on the core) seems to
> bring the corruption back despite the time delay remaining in place.

Gotcha

>>>> Does it do anything if you just put isync() in there instead of the
>>>> sleep in io_wqe_dec_running()?
>>>
>>> Unfortunately no, that doesn't fix things either.
>>>
>>> My gut says it's something racing with either the I/O worker creation
>>> or wake (trying to schedule work onto a half-created or still sleeping
>>> worker?) but so far I haven't been able to isolate the root cause.
>>
>> One of my initial suspicions was improper handling of signals, since
>> creating a new worker will happen via task_work_add(..., TWA_SIGNAL).
>> But I'm not sure how likely that is, as that could/would happen normally
>> too if signals are being used, and would trigger on more than just
>> powerpc obviously.
> 
> Interestingly enough that's where my current investigation is leading
> as well.  After instrumenting and re-instrumenting the codebase far
> more times than I'd like to admit, I've noticed that the workers don't
> time out on kernel builds that show the corruption, they are directly
> terminated via signal (SIGKILL).  On kernel builds with the delay, the
> workers time out and self-terminate.  I'm still trying to parse out
> what the exact difference between these two mechanisms would be and
> how it plays into the corruption, but at least it's a start...

Maybe poke at how they are exiting - you say timeout, so they've been
idle for a while and then go away? This would then cause worker creation
again later on if, say, we have 1 worker left and it goes to sleep. So
the timeout itself may not tell you much, outside of then causing that
other condition to happen. You could even try and shrink the timeout to
HZ / 10 or something like that to make it more likely to happen.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 22:16         ` Jens Axboe
@ 2023-11-07 22:29           ` Timothy Pearson
  2023-11-07 22:44             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 22:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 4:16:51 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>> Interestingly enough that's where my current investigation is leading
>> as well.  After instrumenting and re-instrumenting the codebase far
>> more times than I'd like to admit, I've noticed that the workers don't
>> time out on kernel builds that show the corruption, they are directly
>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>> workers time out and self-terminate.  I'm still trying to parse out
>> what the exact difference between these two mechanisms would be and
>> how it plays into the corruption, but at least it's a start...
> 
> Maybe poke at how they are exiting - you say timeout, so they've been
> idle for a while and then go away? This would then cause worker creation
> again later on if, say, we have 1 worker left and it goes to sleep. So
> the timeout itself may not tell you much, outside of then causing that
> other condition to happen. You could even try and shrink the timeout to
> HZ / 10 or something like that to make it more likely to happen.

Agreed.  As of right now I can confirm that with the delay in place (no corruption) the workers are exiting on their own, no signals and no IO_EXIT bit being set.  When I remove the delay (reintroducing the corruption) I see signal 9 being sent to the workers, and a mix of IO_EXIT being set and not being set.

Ignoring the signal 9 does not fix the corruption, which makes me wonder more about IO_EXIT and whether things are not fully committed / properly torn down when the worker thread terminates.  This also dovetails nicely with the fact that the observed write corruption always seems to be in the latter portions of the page, never at the beginning of the page, also indicating rapid / unclean termination of the writer process.

Will keep digging...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 22:29           ` Timothy Pearson
@ 2023-11-07 22:44             ` Jens Axboe
  2023-11-07 23:12               ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 22:44 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 3:29 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 4:16:51 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>>> Interestingly enough that's where my current investigation is leading
>>> as well.  After instrumenting and re-instrumenting the codebase far
>>> more times than I'd like to admit, I've noticed that the workers don't
>>> time out on kernel builds that show the corruption, they are directly
>>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>>> workers time out and self-terminate.  I'm still trying to parse out
>>> what the exact difference between these two mechanisms would be and
>>> how it plays into the corruption, but at least it's a start...
>>
>> Maybe poke at how they are exiting - you say timeout, so they've been
>> idle for a while and then go away? This would then cause worker creation
>> again later on if, say, we have 1 worker left and it goes to sleep. So
>> the timeout itself may not tell you much, outside of then causing that
>> other condition to happen. You could even try and shrink the timeout to
>> HZ / 10 or something like that to make it more likely to happen.
> 
> Agreed.  As of right now I can confirm that with the delay in place
> (no corruption) the workers are exiting on their own, no signals and
> no IO_EXIT bit being set.  When I remove the delay (reintroducing the
> corruption) I see signal 9 being sent to the workers, and a mix of
> IO_EXIT being set and not being set.
> 
> Ignoring the signal 9 does not fix the corruption, which makes me
> wonder more about IO_EXIT and whether things are not fully committed /
> properly torn down when the worker thread terminates.  This also
> dovetails nicely with the fact that the observed write corruption
> always seems to be in the latter portions of the page, never at the
> beginning of the page, also indicating rapid / unclean termination of
> the writer process.
> 
> Will keep digging...

This is useful. If the workers are exiting, they will try and process
work that is still pending. And it obviously does, or the process would
hang on exit or ring exit. But it'll also cancel said work, which
obviously did not happen for the old kthread scheme, as there was no way
to do that. So you'd just wait for it. Hence maybe what's happening here
is that mtr/mysql/mariadb isn't properly waiting for pending writes to
finish? It's just assuming that previously submitted writes will finish
if the task is killed?

What page size are you using?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 22:44             ` Jens Axboe
@ 2023-11-07 23:12               ` Timothy Pearson
  2023-11-07 23:16                 ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 23:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Timothy Pearson, regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 4:44:56 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 3:29 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 4:16:51 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>>>> Interestingly enough that's where my current investigation is leading
>>>> as well.  After instrumenting and re-instrumenting the codebase far
>>>> more times than I'd like to admit, I've noticed that the workers don't
>>>> time out on kernel builds that show the corruption, they are directly
>>>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>>>> workers time out and self-terminate.  I'm still trying to parse out
>>>> what the exact difference between these two mechanisms would be and
>>>> how it plays into the corruption, but at least it's a start...
>>>
>>> Maybe poke at how they are exiting - you say timeout, so they've been
>>> idle for a while and then go away? This would then cause worker creation
>>> again later on if, say, we have 1 worker left and it goes to sleep. So
>>> the timeout itself may not tell you much, outside of then causing that
>>> other condition to happen. You could even try and shrink the timeout to
>>> HZ / 10 or something like that to make it more likely to happen.
>> 
>> Agreed.  As of right now I can confirm that with the delay in place
>> (no corruption) the workers are exiting on their own, no signals and
>> no IO_EXIT bit being set.  When I remove the delay (reintroducing the
>> corruption) I see signal 9 being sent to the workers, and a mix of
>> IO_EXIT being set and not being set.
>> 
>> Ignoring the signal 9 does not fix the corruption, which makes me
>> wonder more about IO_EXIT and whether things are not fully committed /
>> properly torn down when the worker thread terminates.  This also
>> dovetails nicely with the fact that the observed write corruption
>> always seems to be in the latter portions of the page, never at the
>> beginning of the page, also indicating rapid / unclean termination of
>> the writer process.
>> 
>> Will keep digging...
> 
> This is useful. If the workers are exiting, they will try and process
> work that is still pending. And it obviously does, or the process would
> hang on exit or ring exit. But it'll also cancel said work, which
> obviously did not happen for the old kthread scheme, as there was no way
> to do that. So you'd just wait for it. Hence maybe what's happening here
> is that mtr/mysql/mariadb isn't properly waiting for pending writes to
> finish? It's just assuming that previously submitted writes will finish
> if the task is killed?

It's entirely possible.  What is the correct way to wait for pending writes via the liburing API?  MariaDB uses liburing under the hood, and if I know the call(s) to look for I can make sure it's properly handling task exit.

> What page size are you using?

I've tested on both 4k and 64k page kernels with no difference.  MariaDB is using a 16k page size on disk, and when the corruption happens it's apparently only writing part of the 16k page.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 23:12               ` Timothy Pearson
@ 2023-11-07 23:16                 ` Jens Axboe
  2023-11-07 23:34                   ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 23:16 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 4:12 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 4:44:56 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 3:29 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 4:16:51 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>>>>> Interestingly enough that's where my current investigation is leading
>>>>> as well.  After instrumenting and re-instrumenting the codebase far
>>>>> more times than I'd like to admit, I've noticed that the workers don't
>>>>> time out on kernel builds that show the corruption, they are directly
>>>>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>>>>> workers time out and self-terminate.  I'm still trying to parse out
>>>>> what the exact difference between these two mechanisms would be and
>>>>> how it plays into the corruption, but at least it's a start...
>>>>
>>>> Maybe poke at how they are exiting - you say timeout, so they've been
>>>> idle for a while and then go away? This would then cause worker creation
>>>> again later on if, say, we have 1 worker left and it goes to sleep. So
>>>> the timeout itself may not tell you much, outside of then causing that
>>>> other condition to happen. You could even try and shrink the timeout to
>>>> HZ / 10 or something like that to make it more likely to happen.
>>>
>>> Agreed.  As of right now I can confirm that with the delay in place
>>> (no corruption) the workers are exiting on their own, no signals and
>>> no IO_EXIT bit being set.  When I remove the delay (reintroducing the
>>> corruption) I see signal 9 being sent to the workers, and a mix of
>>> IO_EXIT being set and not being set.
>>>
>>> Ignoring the signal 9 does not fix the corruption, which makes me
>>> wonder more about IO_EXIT and whether things are not fully committed /
>>> properly torn down when the worker thread terminates.  This also
>>> dovetails nicely with the fact that the observed write corruption
>>> always seems to be in the latter portions of the page, never at the
>>> beginning of the page, also indicating rapid / unclean termination of
>>> the writer process.
>>>
>>> Will keep digging...
>>
>> This is useful. If the workers are exiting, they will try and process
>> work that is still pending. And it obviously does, or the process would
>> hang on exit or ring exit. But it'll also cancel said work, which
>> obviously did not happen for the old kthread scheme, as there was no way
>> to do that. So you'd just wait for it. Hence maybe what's happening here
>> is that mtr/mysql/mariadb isn't properly waiting for pending writes to
>> finish? It's just assuming that previously submitted writes will finish
>> if the task is killed?
> 
> It's entirely possible.  What is the correct way to wait for pending
> writes via the liburing API?  MariaDB uses liburing under the hood,
> and if I know the call(s) to look for I can make sure it's properly
> handling task exit.

I'd expect the task to wait and verify the results of pending requests
before doing io_uring_queue_exit(). But I'm not familiar with the code
base, maybe the task just exits? Closing the io_uring fd from an exiting
task would do the same.

I tried the below patch when running mtr here, but don't see any of them
trigger. You can try that, that'll tell you if we ever run cancelations
on pending io-wq work. If that triggers, I can try and cook up something
that would figure out where that is coming from.

But since you said you're seeing exits on signal 9, that would seem to
indicate that someone ran SIGKILL with potentially pending IO.

>> What page size are you using?
> 
> I've tested on both 4k and 64k page kernels with no difference.
> MariaDB is using a 16k page size on disk, and when the corruption
> happens it's apparently only writing part of the 16k page.

OK.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..2ee18905d57e 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 	struct io_wq *wq = worker->wq;
 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
 
+	WARN_ON_ONCE(do_kill);
+
 	do {
 		struct io_wq_work *work;
 
@@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
 {
 	do {
+		WARN_ON_ONCE(1);
 		work->flags |= IO_WQ_WORK_CANCEL;
 		wq->do_work(work);
 		work = wq->free_work(work);
@@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	 */
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
 	    (work->flags & IO_WQ_WORK_CANCEL)) {
+		WARN_ON_ONCE(1);
 		io_run_cancel(work, wq);
 		return;
 	}
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ed254076c723..c0bd35e5429a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
 
 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
 	if (work->flags & IO_WQ_WORK_CANCEL) {
+		WARN_ON_ONCE(1);
 fail:
+		WARN_ON_ONCE(1);
 		io_req_task_queue_fail(req, err);
 		return;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 23:16                 ` Jens Axboe
@ 2023-11-07 23:34                   ` Timothy Pearson
  2023-11-07 23:52                     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-07 23:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 5:16:24 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 4:12 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 4:44:56 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/7/23 3:29 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 7, 2023 4:16:51 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>>>>>> Interestingly enough that's where my current investigation is leading
>>>>>> as well.  After instrumenting and re-instrumenting the codebase far
>>>>>> more times than I'd like to admit, I've noticed that the workers don't
>>>>>> time out on kernel builds that show the corruption, they are directly
>>>>>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>>>>>> workers time out and self-terminate.  I'm still trying to parse out
>>>>>> what the exact difference between these two mechanisms would be and
>>>>>> how it plays into the corruption, but at least it's a start...
>>>>>
>>>>> Maybe poke at how they are exiting - you say timeout, so they've been
>>>>> idle for a while and then go away? This would then cause worker creation
>>>>> again later on if, say, we have 1 worker left and it goes to sleep. So
>>>>> the timeout itself may not tell you much, outside of then causing that
>>>>> other condition to happen. You could even try and shrink the timeout to
>>>>> HZ / 10 or something like that to make it more likely to happen.
>>>>
>>>> Agreed.  As of right now I can confirm that with the delay in place
>>>> (no corruption) the workers are exiting on their own, no signals and
>>>> no IO_EXIT bit being set.  When I remove the delay (reintroducing the
>>>> corruption) I see signal 9 being sent to the workers, and a mix of
>>>> IO_EXIT being set and not being set.
>>>>
>>>> Ignoring the signal 9 does not fix the corruption, which makes me
>>>> wonder more about IO_EXIT and whether things are not fully committed /
>>>> properly torn down when the worker thread terminates.  This also
>>>> dovetails nicely with the fact that the observed write corruption
>>>> always seems to be in the latter portions of the page, never at the
>>>> beginning of the page, also indicating rapid / unclean termination of
>>>> the writer process.
>>>>
>>>> Will keep digging...
>>>
>>> This is useful. If the workers are exiting, they will try and process
>>> work that is still pending. And it obviously does, or the process would
>>> hang on exit or ring exit. But it'll also cancel said work, which
>>> obviously did not happen for the old kthread scheme, as there was no way
>>> to do that. So you'd just wait for it. Hence maybe what's happening here
>>> is that mtr/mysql/mariadb isn't properly waiting for pending writes to
>>> finish? It's just assuming that previously submitted writes will finish
>>> if the task is killed?
>> 
>> It's entirely possible.  What is the correct way to wait for pending
>> writes via the liburing API?  MariaDB uses liburing under the hood,
>> and if I know the call(s) to look for I can make sure it's properly
>> handling task exit.
> 
> I'd expect the task to wait and verify the results of pending requests
> before doing io_uring_queue_exit(). But I'm not familiar with the code
> base, maybe the task just exits? Closing the io_uring fd from an exiting
> task would do the same.
> 
> I tried the below patch when running mtr here, but don't see any of them
> trigger. You can try that, that'll tell you if we ever run cancelations
> on pending io-wq work. If that triggers, I can try and cook up something
> that would figure out where that is coming from.

Doesn't trigger on my end either.  At least we know that's not the problem now.

> But since you said you're seeing exits on signal 9, that would seem to
> indicate that someone ran SIGKILL with potentially pending IO.

Just to make sure we're talking about the same thing, when I refer to signal 9 I am referring to instrumentation I added to the get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I assume the SIGKILL is not expected, i.e. not coming from another location in the io_uring kernel code?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 23:34                   ` Timothy Pearson
@ 2023-11-07 23:52                     ` Jens Axboe
  2023-11-08  0:02                       ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-07 23:52 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 4:34 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 4:12 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 4:44:56 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/7/23 3:29 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Tuesday, November 7, 2023 4:16:51 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/7/23 3:07 PM, Timothy Pearson wrote:
>>>>>>> Interestingly enough that's where my current investigation is leading
>>>>>>> as well.  After instrumenting and re-instrumenting the codebase far
>>>>>>> more times than I'd like to admit, I've noticed that the workers don't
>>>>>>> time out on kernel builds that show the corruption, they are directly
>>>>>>> terminated via signal (SIGKILL).  On kernel builds with the delay, the
>>>>>>> workers time out and self-terminate.  I'm still trying to parse out
>>>>>>> what the exact difference between these two mechanisms would be and
>>>>>>> how it plays into the corruption, but at least it's a start...
>>>>>>
>>>>>> Maybe poke at how they are exiting - you say timeout, so they've been
>>>>>> idle for a while and then go away? This would then cause worker creation
>>>>>> again later on if, say, we have 1 worker left and it goes to sleep. So
>>>>>> the timeout itself may not tell you much, outside of then causing that
>>>>>> other condition to happen. You could even try and shrink the timeout to
>>>>>> HZ / 10 or something like that to make it more likely to happen.
>>>>>
>>>>> Agreed.  As of right now I can confirm that with the delay in place
>>>>> (no corruption) the workers are exiting on their own, no signals and
>>>>> no IO_EXIT bit being set.  When I remove the delay (reintroducing the
>>>>> corruption) I see signal 9 being sent to the workers, and a mix of
>>>>> IO_EXIT being set and not being set.
>>>>>
>>>>> Ignoring the signal 9 does not fix the corruption, which makes me
>>>>> wonder more about IO_EXIT and whether things are not fully committed /
>>>>> properly torn down when the worker thread terminates.  This also
>>>>> dovetails nicely with the fact that the observed write corruption
>>>>> always seems to be in the latter portions of the page, never at the
>>>>> beginning of the page, also indicating rapid / unclean termination of
>>>>> the writer process.
>>>>>
>>>>> Will keep digging...
>>>>
>>>> This is useful. If the workers are exiting, they will try and process
>>>> work that is still pending. And it obviously does, or the process would
>>>> hang on exit or ring exit. But it'll also cancel said work, which
>>>> obviously did not happen for the old kthread scheme, as there was no way
>>>> to do that. So you'd just wait for it. Hence maybe what's happening here
>>>> is that mtr/mysql/mariadb isn't properly waiting for pending writes to
>>>> finish? It's just assuming that previously submitted writes will finish
>>>> if the task is killed?
>>>
>>> It's entirely possible.  What is the correct way to wait for pending
>>> writes via the liburing API?  MariaDB uses liburing under the hood,
>>> and if I know the call(s) to look for I can make sure it's properly
>>> handling task exit.
>>
>> I'd expect the task to wait and verify the results of pending requests
>> before doing io_uring_queue_exit(). But I'm not familiar with the code
>> base, maybe the task just exits? Closing the io_uring fd from an exiting
>> task would do the same.
>>
>> I tried the below patch when running mtr here, but don't see any of them
>> trigger. You can try that, that'll tell you if we ever run cancelations
>> on pending io-wq work. If that triggers, I can try and cook up something
>> that would figure out where that is coming from.
> 
> Doesn't trigger on my end either.  At least we know that's not the
> problem now.

Indeed

>> But since you said you're seeing exits on signal 9, that would seem to
>> indicate that someone ran SIGKILL with potentially pending IO.
> 
> Just to make sure we're talking about the same thing, when I refer to
> signal 9 I am referring to instrumentation I added to the
> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
> assume the SIGKILL is not expected, i.e. not coming from another
> location in the io_uring kernel code?

ksig.sig is set to signr, which is the number of the signal. So that
would seem to indicate that _someone_ is sending SIGKILL. But at the
same time, you don't see any of the cancel-on-exit triggering. Puzzling!
io_uring or io-wq doesn't send any signals.

In your instrumentation, are you checking where the signal is coming
from? Is it being dequeued as an actual signal, or is it some other
condition in get_signal() that ends up setting it to SIGKILL?

I don't expect the below to do anything as it _seems_ it's correct
as-is, but...

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index aa17e62f3754..bfec1e95b362 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -244,16 +244,17 @@ static void do_signal(struct task_struct *tsk)
 {
 	sigset_t *oldset = sigmask_to_save();
 	struct ksignal ksig = { .sig = 0 };
+	bool got_signal;
 	int ret;
 
 	BUG_ON(tsk != current);
 
-	get_signal(&ksig);
+	got_signal = get_signal(&ksig);
 
 	/* Is there any syscall restart business here ? */
 	check_syscall_restart(tsk->thread.regs, &ksig.ka, ksig.sig > 0);
 
-	if (ksig.sig <= 0) {
+	if (!got_signal) {
 		/* No signal to deliver -- put the saved sigmask back */
 		restore_saved_sigmask();
 		set_trap_norestart(tsk->thread.regs);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-07 23:52                     ` Jens Axboe
@ 2023-11-08  0:02                       ` Timothy Pearson
  2023-11-08  0:09                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08  0:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 5:52:54 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>> But since you said you're seeing exits on signal 9, that would seem to
>>> indicate that someone ran SIGKILL with potentially pending IO.
>> 
>> Just to make sure we're talking about the same thing, when I refer to
>> signal 9 I am referring to instrumentation I added to the
>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>> assume the SIGKILL is not expected, i.e. not coming from another
>> location in the io_uring kernel code?
> 
> ksig.sig is set to signr, which is the number of the signal. So that
> would seem to indicate that _someone_ is sending SIGKILL. But at the
> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
> io_uring or io-wq doesn't send any signals.
> 
> In your instrumentation, are you checking where the signal is coming
> from?Is it being dequeued as an actual signal, or is it some other
> condition in get_signal() that ends up setting it to SIGKILL?

I still need to check this in more detail.  What I do know at the moment is that the kill seems to be related to the entire workqueue going down, i.e. it is sent to the workers of a given workqueue after io_wq_put_and_exit() and io_wq_exit_workers() are called for that workqueue.  Not sure if that helps any, will keep digging...

> I don't expect the below to do anything as it _seems_ it's correct
> as-is, but...

<snip>

No change.  As you said, it's correct as-is. :)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08  0:02                       ` Timothy Pearson
@ 2023-11-08  0:09                         ` Jens Axboe
  2023-11-08  3:27                           ` Timothy Pearson
  2023-11-08  4:00                           ` Timothy Pearson
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-08  0:09 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 5:02 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 5:52:54 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>> But since you said you're seeing exits on signal 9, that would seem to
>>>> indicate that someone ran SIGKILL with potentially pending IO.
>>>
>>> Just to make sure we're talking about the same thing, when I refer to
>>> signal 9 I am referring to instrumentation I added to the
>>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>>> assume the SIGKILL is not expected, i.e. not coming from another
>>> location in the io_uring kernel code?
>>
>> ksig.sig is set to signr, which is the number of the signal. So that
>> would seem to indicate that _someone_ is sending SIGKILL. But at the
>> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
>> io_uring or io-wq doesn't send any signals.
>>
>> In your instrumentation, are you checking where the signal is coming
>> from?Is it being dequeued as an actual signal, or is it some other
>> condition in get_signal() that ends up setting it to SIGKILL?
> 
> I still need to check this in more detail.  What I do know at the
> moment is that the kill seems to be related to the entire workqueue
> going down, i.e. it is sent to the workers of a given workqueue after
> io_wq_put_and_exit() and io_wq_exit_workers() are called for that
> workqueue.  Not sure if that helps any, will keep digging...

Another option is that it's doing exec() with pending IO, which will
cancel it. But I'm also assuming that mtr/friends will check result
values of writes and would've caught that there.

>> I don't expect the below to do anything as it _seems_ it's correct
>> as-is, but...
> 
> <snip>
> 
> No change.  As you said, it's correct as-is. :)

I thought so, but try and not take anything for granted at this point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08  0:09                         ` Jens Axboe
@ 2023-11-08  3:27                           ` Timothy Pearson
  2023-11-08  3:30                             ` Timothy Pearson
  2023-11-08  4:00                           ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08  3:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 6:09:39 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 5:02 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 5:52:54 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>> But since you said you're seeing exits on signal 9, that would seem to
>>>>> indicate that someone ran SIGKILL with potentially pending IO.
>>>>
>>>> Just to make sure we're talking about the same thing, when I refer to
>>>> signal 9 I am referring to instrumentation I added to the
>>>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>>>> assume the SIGKILL is not expected, i.e. not coming from another
>>>> location in the io_uring kernel code?
>>>
>>> ksig.sig is set to signr, which is the number of the signal. So that
>>> would seem to indicate that _someone_ is sending SIGKILL. But at the
>>> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
>>> io_uring or io-wq doesn't send any signals.
>>>
>>> In your instrumentation, are you checking where the signal is coming
>>> from?Is it being dequeued as an actual signal, or is it some other
>>> condition in get_signal() that ends up setting it to SIGKILL?
>> 
>> I still need to check this in more detail.  What I do know at the
>> moment is that the kill seems to be related to the entire workqueue
>> going down, i.e. it is sent to the workers of a given workqueue after
>> io_wq_put_and_exit() and io_wq_exit_workers() are called for that
>> workqueue.  Not sure if that helps any, will keep digging...
> 
> Another option is that it's doing exec() with pending IO, which will
> cancel it. But I'm also assuming that mtr/friends will check result
> values of writes and would've caught that there.

So I ran it a few more times and we do actually hit the WARN_ON_ONCE(do_kill) in io_worker_handle_work().  It's possible we hit it much earlier in the boot process before and I simply missed it, since it's only printed once per boot.  If you have a potential patch or avenue of testing based on that information I'm happy to try it.

The PID and code of the sending process are both 0 (i.e. SI_USER), assuming get_signal() is actually populating those fields.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08  3:27                           ` Timothy Pearson
@ 2023-11-08  3:30                             ` Timothy Pearson
  0 siblings, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08  3:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineeringinc.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 9:27:13 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 6:09:39 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 5:02 PM, Timothy Pearson wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 5:52:54 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>> 
>>>> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> But since you said you're seeing exits on signal 9, that would seem to
>>>>>> indicate that someone ran SIGKILL with potentially pending IO.
>>>>>
>>>>> Just to make sure we're talking about the same thing, when I refer to
>>>>> signal 9 I am referring to instrumentation I added to the
>>>>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>>>>> assume the SIGKILL is not expected, i.e. not coming from another
>>>>> location in the io_uring kernel code?
>>>>
>>>> ksig.sig is set to signr, which is the number of the signal. So that
>>>> would seem to indicate that _someone_ is sending SIGKILL. But at the
>>>> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
>>>> io_uring or io-wq doesn't send any signals.
>>>>
>>>> In your instrumentation, are you checking where the signal is coming
>>>> from?Is it being dequeued as an actual signal, or is it some other
>>>> condition in get_signal() that ends up setting it to SIGKILL?
>>> 
>>> I still need to check this in more detail.  What I do know at the
>>> moment is that the kill seems to be related to the entire workqueue
>>> going down, i.e. it is sent to the workers of a given workqueue after
>>> io_wq_put_and_exit() and io_wq_exit_workers() are called for that
>>> workqueue.  Not sure if that helps any, will keep digging...
>> 
>> Another option is that it's doing exec() with pending IO, which will
>> cancel it. But I'm also assuming that mtr/friends will check result
>> values of writes and would've caught that there.
> 
> So I ran it a few more times and we do actually hit the WARN_ON_ONCE(do_kill) in
> io_worker_handle_work().  It's possible we hit it much earlier in the boot
> process before and I simply missed it, since it's only printed once per boot.
> If you have a potential patch or avenue of testing based on that information
> I'm happy to try it.

Actually, I take that back.  I had other debugging in place that tripped that condition.  Investigation continues...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08  0:09                         ` Jens Axboe
  2023-11-08  3:27                           ` Timothy Pearson
@ 2023-11-08  4:00                           ` Timothy Pearson
  2023-11-08 15:10                             ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08  4:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Timothy Pearson, regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 7, 2023 6:09:39 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/7/23 5:02 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 7, 2023 5:52:54 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>> But since you said you're seeing exits on signal 9, that would seem to
>>>>> indicate that someone ran SIGKILL with potentially pending IO.
>>>>
>>>> Just to make sure we're talking about the same thing, when I refer to
>>>> signal 9 I am referring to instrumentation I added to the
>>>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>>>> assume the SIGKILL is not expected, i.e. not coming from another
>>>> location in the io_uring kernel code?
>>>
>>> ksig.sig is set to signr, which is the number of the signal. So that
>>> would seem to indicate that _someone_ is sending SIGKILL. But at the
>>> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
>>> io_uring or io-wq doesn't send any signals.
>>>
>>> In your instrumentation, are you checking where the signal is coming
>>> from?Is it being dequeued as an actual signal, or is it some other
>>> condition in get_signal() that ends up setting it to SIGKILL?
>> 
>> I still need to check this in more detail.  What I do know at the
>> moment is that the kill seems to be related to the entire workqueue
>> going down, i.e. it is sent to the workers of a given workqueue after
>> io_wq_put_and_exit() and io_wq_exit_workers() are called for that
>> workqueue.  Not sure if that helps any, will keep digging...
> 
> Another option is that it's doing exec() with pending IO, which will
> cancel it. But I'm also assuming that mtr/friends will check result
> values of writes and would've caught that there.

Here's something potentially more useful, the stack trace of what's calling the workqueue exit function:

[   35.074923] Call Trace:
[   35.074939] [c000000007183890] [c0000000005c0804] io_wq_exit_workers+0x48/0x214 (unreliable)
[   35.074994] [c000000007183920] [c0000000005c0410] io_wq_put+0xa0/0x27c
[   35.075034] [c0000000071839f0] [c0000000005a92a8] io_uring_clean_tctx+0x98/0xe0
[   35.075082] [c000000007183a30] [c0000000005bdc78] __io_uring_files_cancel+0x4f8/0x580
[   35.075129] [c000000007183b20] [c000000000153fc8] do_exit+0x1c8/0xce0
[   35.075169] [c000000007183bf0] [c000000000154c58] do_group_exit+0x108/0x110
[   35.075209] [c000000007183c30] [c00000000016ac6c] get_signal+0xbfc/0xce0
[   35.075251] [c000000007183d10] [c00000000001ed50] do_notify_resume+0xd0/0x420
[   35.075298] [c000000007183dc0] [c00000000002dbbc] syscall_exit_prepare+0x15c/0x360
[   35.075348] [c000000007183e10] [c00000000000cf74] system_call_vectored_common+0xf4/0x260

This call is associated with the mariadb userspace PID:

[   35.074242] CPU: 12 PID: 1260 Comm: mariadbd

To me, it almost looks like the mariadb I/O worker thread is getting terminated via signal and that this termination is taking out the write before it can complete.  Is that even possible?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08  4:00                           ` Timothy Pearson
@ 2023-11-08 15:10                             ` Jens Axboe
  2023-11-08 15:14                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 15:10 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/7/23 9:00 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 7, 2023 6:09:39 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/7/23 5:02 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 7, 2023 5:52:54 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/7/23 4:34 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Tuesday, November 7, 2023 5:16:24 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> But since you said you're seeing exits on signal 9, that would seem to
>>>>>> indicate that someone ran SIGKILL with potentially pending IO.
>>>>>
>>>>> Just to make sure we're talking about the same thing, when I refer to
>>>>> signal 9 I am referring to instrumentation I added to the
>>>>> get_signal(&ksig) call.  I am seeing 9 in the ksig.sig field.  I
>>>>> assume the SIGKILL is not expected, i.e. not coming from another
>>>>> location in the io_uring kernel code?
>>>>
>>>> ksig.sig is set to signr, which is the number of the signal. So that
>>>> would seem to indicate that _someone_ is sending SIGKILL. But at the
>>>> same time, you don't see any of the cancel-on-exit triggering. Puzzling!
>>>> io_uring or io-wq doesn't send any signals.
>>>>
>>>> In your instrumentation, are you checking where the signal is coming
>>>> from?Is it being dequeued as an actual signal, or is it some other
>>>> condition in get_signal() that ends up setting it to SIGKILL?
>>>
>>> I still need to check this in more detail.  What I do know at the
>>> moment is that the kill seems to be related to the entire workqueue
>>> going down, i.e. it is sent to the workers of a given workqueue after
>>> io_wq_put_and_exit() and io_wq_exit_workers() are called for that
>>> workqueue.  Not sure if that helps any, will keep digging...
>>
>> Another option is that it's doing exec() with pending IO, which will
>> cancel it. But I'm also assuming that mtr/friends will check result
>> values of writes and would've caught that there.
> 
> Here's something potentially more useful, the stack trace of what's
> calling the workqueue exit function:
> 
> [   35.074923] Call Trace:
> [   35.074939] [c000000007183890] [c0000000005c0804] io_wq_exit_workers+0x48/0x214 (unreliable)
> [   35.074994] [c000000007183920] [c0000000005c0410] io_wq_put+0xa0/0x27c
> [   35.075034] [c0000000071839f0] [c0000000005a92a8] io_uring_clean_tctx+0x98/0xe0
> [   35.075082] [c000000007183a30] [c0000000005bdc78] __io_uring_files_cancel+0x4f8/0x580
> [   35.075129] [c000000007183b20] [c000000000153fc8] do_exit+0x1c8/0xce0
> [   35.075169] [c000000007183bf0] [c000000000154c58] do_group_exit+0x108/0x110
> [   35.075209] [c000000007183c30] [c00000000016ac6c] get_signal+0xbfc/0xce0
> [   35.075251] [c000000007183d10] [c00000000001ed50] do_notify_resume+0xd0/0x420
> [   35.075298] [c000000007183dc0] [c00000000002dbbc] syscall_exit_prepare+0x15c/0x360
> [   35.075348] [c000000007183e10] [c00000000000cf74] system_call_vectored_common+0xf4/0x260

The task owning (or using) the ring got sent a signal, and it's now
exiting which entails canceling pending IO operations too.

> This call is associated with the mariadb userspace PID:
> 
> [   35.074242] CPU: 12 PID: 1260 Comm: mariadbd
> 
> To me, it almost looks like the mariadb I/O worker thread is getting
> terminated via signal and that this termination is taking out the
> write before it can complete.  Is that even possible?

If a write has been started, then it should finish. For storage IO,
there's no way to cancel a write that's already in progress. But async
writes queue up, and if an io-wq worker hasn't retrieved it yet, it will
be found and canceled, and hence never make it to stable media. But this
should also have triggered the cancelation WARN_ON() stuff I added in
that debug patch, but not sure if you are running with that as well.

Might be useful to add some debugging around where signals are sent,
rather than retrieved. But seems like we'd already know it's SIGKILL per
previous debugging.

It could also be a task that has pending IO and is doing exec() (and
friends), this would also cancel inflight IO.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 15:10                             ` Jens Axboe
@ 2023-11-08 15:14                               ` Jens Axboe
  2023-11-08 17:10                                 ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 15:14 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 8:10 AM, Jens Axboe wrote:
> It could also be a task that has pending IO and is doing exec() (and
> friends), this would also cancel inflight IO.

If this is the case, then you could try with this one to just disable
that and see if the corruption goes away:

diff --git a/fs/exec.c b/fs/exec.c
index 4aa19b24f281..7359a85b96ee 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,7 +1266,9 @@ int begin_new_exec(struct linux_binprm * bprm)
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
+#if 0
 	io_uring_task_cancel();
+#endif
 
 	/* Ensure the files table is not shared. */
 	retval = unshare_files();

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 15:14                               ` Jens Axboe
@ 2023-11-08 17:10                                 ` Timothy Pearson
  2023-11-08 17:26                                   ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 17:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov

----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 9:14:55 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 8:10 AM, Jens Axboe wrote:
>> It could also be a task that has pending IO and is doing exec() (and
>> friends), this would also cancel inflight IO.
> 
> If this is the case, then you could try with this one to just disable
> that and see if the corruption goes away:

Unfortunately that had no effect on the corruption.  I've also traced the signal generation into the get_signal() call, which is apparently sending SIGKILL when the thread group is marked for termination -- this is in turn why the PID fields etc. are all zero.

Investigation continues.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 17:10                                 ` Timothy Pearson
@ 2023-11-08 17:26                                   ` Jens Axboe
  2023-11-08 17:40                                     ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 17:26 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 10:10 AM, Timothy Pearson wrote:
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>> It could also be a task that has pending IO and is doing exec() (and
>>> friends), this would also cancel inflight IO.
>>
>> If this is the case, then you could try with this one to just disable
>> that and see if the corruption goes away:
> 
> Unfortunately that had no effect on the corruption.  I've also traced
> the signal generation into the get_signal() call, which is apparently
> sending SIGKILL when the thread group is marked for termination --
> this is in turn why the PID fields etc. are all zero.

That's good news though, because I'm continually pondering why powerpc
is different here.

> Investigation continues.

If it's not exec, then it has to be a signal. I'm assuming you're
hitting this in get_signal():

		/* Has this task already been marked for death? */
		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
		     signal->group_exec_task) {
			clear_siginfo(&ksig->info);
			ksig->info.si_signo = signr = SIGKILL;
			sigdelset(&current->pending.signal, SIGKILL);
			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
				&sighand->action[SIGKILL - 1]);
			recalc_sigpending();
			goto fatal;
		}

which is either exec (which we verified it is not), so I don't see
anything other than this being a signal sent to mtr/mariadb for exit.

Does this trigger? Doesn't necessarily indicate a bug as it would be
valid, but if it does trigger, perhaps io-wq has unstarted requests at
this point and they get canceled and hence never written. If this does
trigger, maybe try and do your sleep trick there too and see if that
gets rid of it.


diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..53e4c3324672 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1011,6 +1013,8 @@ do_group_exit(int exit_code)
 		else if (sig->group_exec_task)
 			exit_code = 0;
 		else {
+			if (current->io_uring)
+				WARN_ON(!strncmp(current->comm, "mariadbd", 8));
 			sig->group_exit_code = exit_code;
 			sig->flags = SIGNAL_GROUP_EXIT;
 			zap_other_threads(current);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 17:26                                   ` Jens Axboe
@ 2023-11-08 17:40                                     ` Timothy Pearson
  2023-11-08 17:49                                       ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 17:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 11:26:53 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>> It could also be a task that has pending IO and is doing exec() (and
>>>> friends), this would also cancel inflight IO.
>>>
>>> If this is the case, then you could try with this one to just disable
>>> that and see if the corruption goes away:
>> 
>> Unfortunately that had no effect on the corruption.  I've also traced
>> the signal generation into the get_signal() call, which is apparently
>> sending SIGKILL when the thread group is marked for termination --
>> this is in turn why the PID fields etc. are all zero.
> 
> That's good news though, because I'm continually pondering why powerpc
> is different here.
> 
>> Investigation continues.
> 
> If it's not exec, then it has to be a signal. I'm assuming you're
> hitting this in get_signal():
> 
>		/* Has this task already been marked for death? */
>		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>		     signal->group_exec_task) {
>			clear_siginfo(&ksig->info);
>			ksig->info.si_signo = signr = SIGKILL;
>			sigdelset(&current->pending.signal, SIGKILL);
>			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>				&sighand->action[SIGKILL - 1]);
>			recalc_sigpending();
>			goto fatal;
>		}
> 
> which is either exec (which we verified it is not), so I don't see
> anything other than this being a signal sent to mtr/mariadb for exit.
> 
> Does this trigger? Doesn't necessarily indicate a bug as it would be
> valid, but if it does trigger, perhaps io-wq has unstarted requests at
> this point and they get canceled and hence never written. If this does
> trigger, maybe try and do your sleep trick there too and see if that
> gets rid of it.

Yes, it does indeed trigger.  Is there a way to directly check for the unstarted requests?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 17:40                                     ` Timothy Pearson
@ 2023-11-08 17:49                                       ` Jens Axboe
  2023-11-08 17:57                                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 17:49 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 10:40 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>> friends), this would also cancel inflight IO.
>>>>
>>>> If this is the case, then you could try with this one to just disable
>>>> that and see if the corruption goes away:
>>>
>>> Unfortunately that had no effect on the corruption.  I've also traced
>>> the signal generation into the get_signal() call, which is apparently
>>> sending SIGKILL when the thread group is marked for termination --
>>> this is in turn why the PID fields etc. are all zero.
>>
>> That's good news though, because I'm continually pondering why powerpc
>> is different here.
>>
>>> Investigation continues.
>>
>> If it's not exec, then it has to be a signal. I'm assuming you're
>> hitting this in get_signal():
>>
>> 		/* Has this task already been marked for death? */
>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>> 		     signal->group_exec_task) {
>> 			clear_siginfo(&ksig->info);
>> 			ksig->info.si_signo = signr = SIGKILL;
>> 			sigdelset(&current->pending.signal, SIGKILL);
>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>> 				&sighand->action[SIGKILL - 1]);
>> 			recalc_sigpending();
>> 			goto fatal;
>> 		}
>>
>> which is either exec (which we verified it is not), so I don't see
>> anything other than this being a signal sent to mtr/mariadb for exit.
>>
>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>> this point and they get canceled and hence never written. If this does
>> trigger, maybe try and do your sleep trick there too and see if that
>> gets rid of it.
> 
> Yes, it does indeed trigger.  Is there a way to directly check for the
> unstarted requests?

Let me hack up a debug patch for this, give me a minute.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 17:49                                       ` Jens Axboe
@ 2023-11-08 17:57                                         ` Jens Axboe
  2023-11-08 18:36                                           ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 17:57 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 10:49 AM, Jens Axboe wrote:
> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>> friends), this would also cancel inflight IO.
>>>>>
>>>>> If this is the case, then you could try with this one to just disable
>>>>> that and see if the corruption goes away:
>>>>
>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>> the signal generation into the get_signal() call, which is apparently
>>>> sending SIGKILL when the thread group is marked for termination --
>>>> this is in turn why the PID fields etc. are all zero.
>>>
>>> That's good news though, because I'm continually pondering why powerpc
>>> is different here.
>>>
>>>> Investigation continues.
>>>
>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>> hitting this in get_signal():
>>>
>>> 		/* Has this task already been marked for death? */
>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>> 		     signal->group_exec_task) {
>>> 			clear_siginfo(&ksig->info);
>>> 			ksig->info.si_signo = signr = SIGKILL;
>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>> 				&sighand->action[SIGKILL - 1]);
>>> 			recalc_sigpending();
>>> 			goto fatal;
>>> 		}
>>>
>>> which is either exec (which we verified it is not), so I don't see
>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>
>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>> this point and they get canceled and hence never written. If this does
>>> trigger, maybe try and do your sleep trick there too and see if that
>>> gets rid of it.
>>
>> Yes, it does indeed trigger.  Is there a way to directly check for the
>> unstarted requests?
> 
> Let me hack up a debug patch for this, give me a minute.

This should do it - whenever this condition hits, you should see
something ala:

[   97.960877] io_wq_dump: work_items=0, cur=0, next=0

in dmesg. work_items is the number of work items we found that haven't
been scheduled yet. cur is what a worker is currently processing, and
next is basically a way for cancel to find a work item before it gets
assigned. work_items and next may get canceled, work_items should always
finish for storage IO, since signals don't interrupt them.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..643b8e9de518 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 	struct io_wq *wq = worker->wq;
 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
 
+	WARN_ON_ONCE(do_kill);
+
 	do {
 		struct io_wq_work *work;
 
@@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
 {
 	do {
+		WARN_ON_ONCE(1);
 		work->flags |= IO_WQ_WORK_CANCEL;
 		wq->do_work(work);
 		work = wq->free_work(work);
@@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	 */
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
 	    (work->flags & IO_WQ_WORK_CANCEL)) {
+		WARN_ON_ONCE(1);
 		io_run_cancel(work, wq);
 		return;
 	}
@@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
 	return 0;
 }
 
+struct worker_lookup {
+	int cur_work;
+	int next_work;
+};
+
+static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
+{
+	struct worker_lookup *l = data;
+
+	raw_spin_lock(&worker->lock);
+	if (worker->cur_work)
+		l->cur_work++;
+	if (worker->next_work)
+		l->next_work++;
+	raw_spin_unlock(&worker->lock);
+	return false;
+}
+
+void io_wq_dump(struct io_uring_task *tctx)
+{
+	struct io_wq_work_node *node, *prev;
+	struct io_wq *wq = tctx->io_wq;
+	struct worker_lookup l = { };
+	int i, work_items;
+
+	if (!wq) {
+		printk("%s: no wq\n", __FUNCTION__);
+		return;
+	}
+
+	work_items = 0;
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
+
+		raw_spin_lock(&acct->lock);
+		wq_list_for_each(node, prev, &acct->work_list)
+			work_items++;
+		raw_spin_unlock(&acct->lock);
+	}
+
+	rcu_read_lock();
+	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
+	rcu_read_unlock();
+
+	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
+							l.cur_work, l.next_work);
+}
+
 static __init int io_wq_init(void)
 {
 	int ret;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ed254076c723..c0bd35e5429a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
 
 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
 	if (work->flags & IO_WQ_WORK_CANCEL) {
+		WARN_ON_ONCE(1);
 fail:
+		WARN_ON_ONCE(1);
 		io_req_task_queue_fail(req, err);
 		return;
 	}
diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..250ae820340c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
 	do_exit((error_code&0xff)<<8);
 }
 
+void io_wq_dump(struct io_uring_task *);
+
 /*
  * Take down every thread in the group.  This is called by fatal signals
  * as well as by sys_exit_group (below).
@@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
 		else if (sig->group_exec_task)
 			exit_code = 0;
 		else {
+			if (!strncmp(current->comm, "mariadbd", 8) &&
+			    current->io_uring)
+				io_wq_dump(current->io_uring);
 			sig->group_exit_code = exit_code;
 			sig->flags = SIGNAL_GROUP_EXIT;
 			zap_other_threads(current);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 17:57                                         ` Jens Axboe
@ 2023-11-08 18:36                                           ` Timothy Pearson
  2023-11-08 18:51                                             ` Timothy Pearson
  2023-11-08 19:06                                             ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 18:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 11:57:59 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 10:49 AM, Jens Axboe wrote:
>> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>>> friends), this would also cancel inflight IO.
>>>>>>
>>>>>> If this is the case, then you could try with this one to just disable
>>>>>> that and see if the corruption goes away:
>>>>>
>>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>>> the signal generation into the get_signal() call, which is apparently
>>>>> sending SIGKILL when the thread group is marked for termination --
>>>>> this is in turn why the PID fields etc. are all zero.
>>>>
>>>> That's good news though, because I'm continually pondering why powerpc
>>>> is different here.
>>>>
>>>>> Investigation continues.
>>>>
>>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>>> hitting this in get_signal():
>>>>
>>>> 		/* Has this task already been marked for death? */
>>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>>> 		     signal->group_exec_task) {
>>>> 			clear_siginfo(&ksig->info);
>>>> 			ksig->info.si_signo = signr = SIGKILL;
>>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>>> 				&sighand->action[SIGKILL - 1]);
>>>> 			recalc_sigpending();
>>>> 			goto fatal;
>>>> 		}
>>>>
>>>> which is either exec (which we verified it is not), so I don't see
>>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>>
>>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>>> this point and they get canceled and hence never written. If this does
>>>> trigger, maybe try and do your sleep trick there too and see if that
>>>> gets rid of it.
>>>
>>> Yes, it does indeed trigger.  Is there a way to directly check for the
>>> unstarted requests?
>> 
>> Let me hack up a debug patch for this, give me a minute.
> 
> This should do it - whenever this condition hits, you should see
> something ala:
> 
> [   97.960877] io_wq_dump: work_items=0, cur=0, next=0
> 
> in dmesg. work_items is the number of work items we found that haven't
> been scheduled yet. cur is what a worker is currently processing, and
> next is basically a way for cancel to find a work item before it gets
> assigned. work_items and next may get canceled, work_items should always
> finish for storage IO, since signals don't interrupt them.
> 
> 
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 522196dfb0ff..643b8e9de518 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
> 	struct io_wq *wq = worker->wq;
> 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
> 
> +	WARN_ON_ONCE(do_kill);
> +
> 	do {
> 		struct io_wq_work *work;
> 
> @@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void
> *data)
> static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
> {
> 	do {
> +		WARN_ON_ONCE(1);
> 		work->flags |= IO_WQ_WORK_CANCEL;
> 		wq->do_work(work);
> 		work = wq->free_work(work);
> @@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work
> *work)
> 	 */
> 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
> 	    (work->flags & IO_WQ_WORK_CANCEL)) {
> +		WARN_ON_ONCE(1);
> 		io_run_cancel(work, wq);
> 		return;
> 	}
> @@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
> 	return 0;
> }
> 
> +struct worker_lookup {
> +	int cur_work;
> +	int next_work;
> +};
> +
> +static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
> +{
> +	struct worker_lookup *l = data;
> +
> +	raw_spin_lock(&worker->lock);
> +	if (worker->cur_work)
> +		l->cur_work++;
> +	if (worker->next_work)
> +		l->next_work++;
> +	raw_spin_unlock(&worker->lock);
> +	return false;
> +}
> +
> +void io_wq_dump(struct io_uring_task *tctx)
> +{
> +	struct io_wq_work_node *node, *prev;
> +	struct io_wq *wq = tctx->io_wq;
> +	struct worker_lookup l = { };
> +	int i, work_items;
> +
> +	if (!wq) {
> +		printk("%s: no wq\n", __FUNCTION__);
> +		return;
> +	}
> +
> +	work_items = 0;
> +	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
> +		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
> +
> +		raw_spin_lock(&acct->lock);
> +		wq_list_for_each(node, prev, &acct->work_list)
> +			work_items++;
> +		raw_spin_unlock(&acct->lock);
> +	}
> +
> +	rcu_read_lock();
> +	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
> +	rcu_read_unlock();
> +
> +	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
> +							l.cur_work, l.next_work);
> +}
> +
> static __init int io_wq_init(void)
> {
> 	int ret;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index ed254076c723..c0bd35e5429a 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
> 
> 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
> 	if (work->flags & IO_WQ_WORK_CANCEL) {
> +		WARN_ON_ONCE(1);
> fail:
> +		WARN_ON_ONCE(1);
> 		io_req_task_queue_fail(req, err);
> 		return;
> 	}
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ee9f43bed49a..250ae820340c 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
> 	do_exit((error_code&0xff)<<8);
> }
> 
> +void io_wq_dump(struct io_uring_task *);
> +
> /*
>  * Take down every thread in the group.  This is called by fatal signals
>  * as well as by sys_exit_group (below).
> @@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
> 		else if (sig->group_exec_task)
> 			exit_code = 0;
> 		else {
> +			if (!strncmp(current->comm, "mariadbd", 8) &&
> +			    current->io_uring)
> +				io_wq_dump(current->io_uring);
> 			sig->group_exit_code = exit_code;
> 			sig->flags = SIGNAL_GROUP_EXIT;
> 			zap_other_threads(current);

Unfortunately it's only returning work_items=0, cur=0, next=0, so that was a bit of a red herring.

I have been giving some thought to the CPU pinning of the workers, and one thing that may have been overlooked is that this could potentially force-serialize worker operations.  Did you just have to pin the io workers or did the workqueue also need to be pinned for the corruption to disappear?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 18:36                                           ` Timothy Pearson
@ 2023-11-08 18:51                                             ` Timothy Pearson
  2023-11-08 19:08                                               ` Jens Axboe
  2023-11-08 19:06                                             ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 18:51 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 12:36:01 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 11:57:59 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 10:49 AM, Jens Axboe wrote:
>>> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>>>> friends), this would also cancel inflight IO.
>>>>>>>
>>>>>>> If this is the case, then you could try with this one to just disable
>>>>>>> that and see if the corruption goes away:
>>>>>>
>>>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>>>> the signal generation into the get_signal() call, which is apparently
>>>>>> sending SIGKILL when the thread group is marked for termination --
>>>>>> this is in turn why the PID fields etc. are all zero.
>>>>>
>>>>> That's good news though, because I'm continually pondering why powerpc
>>>>> is different here.
>>>>>
>>>>>> Investigation continues.
>>>>>
>>>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>>>> hitting this in get_signal():
>>>>>
>>>>> 		/* Has this task already been marked for death? */
>>>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>>>> 		     signal->group_exec_task) {
>>>>> 			clear_siginfo(&ksig->info);
>>>>> 			ksig->info.si_signo = signr = SIGKILL;
>>>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>>>> 				&sighand->action[SIGKILL - 1]);
>>>>> 			recalc_sigpending();
>>>>> 			goto fatal;
>>>>> 		}
>>>>>
>>>>> which is either exec (which we verified it is not), so I don't see
>>>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>>>
>>>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>>>> this point and they get canceled and hence never written. If this does
>>>>> trigger, maybe try and do your sleep trick there too and see if that
>>>>> gets rid of it.
>>>>
>>>> Yes, it does indeed trigger.  Is there a way to directly check for the
>>>> unstarted requests?
>>> 
>>> Let me hack up a debug patch for this, give me a minute.
>> 
>> This should do it - whenever this condition hits, you should see
>> something ala:
>> 
>> [   97.960877] io_wq_dump: work_items=0, cur=0, next=0
>> 
>> in dmesg. work_items is the number of work items we found that haven't
>> been scheduled yet. cur is what a worker is currently processing, and
>> next is basically a way for cancel to find a work item before it gets
>> assigned. work_items and next may get canceled, work_items should always
>> finish for storage IO, since signals don't interrupt them.
>> 
>> 
>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>> index 522196dfb0ff..643b8e9de518 100644
>> --- a/io_uring/io-wq.c
>> +++ b/io_uring/io-wq.c
>> @@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>> 	struct io_wq *wq = worker->wq;
>> 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
>> 
>> +	WARN_ON_ONCE(do_kill);
>> +
>> 	do {
>> 		struct io_wq_work *work;
>> 
>> @@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void
>> *data)
>> static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
>> {
>> 	do {
>> +		WARN_ON_ONCE(1);
>> 		work->flags |= IO_WQ_WORK_CANCEL;
>> 		wq->do_work(work);
>> 		work = wq->free_work(work);
>> @@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work
>> *work)
>> 	 */
>> 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
>> 	    (work->flags & IO_WQ_WORK_CANCEL)) {
>> +		WARN_ON_ONCE(1);
>> 		io_run_cancel(work, wq);
>> 		return;
>> 	}
>> @@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
>> 	return 0;
>> }
>> 
>> +struct worker_lookup {
>> +	int cur_work;
>> +	int next_work;
>> +};
>> +
>> +static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
>> +{
>> +	struct worker_lookup *l = data;
>> +
>> +	raw_spin_lock(&worker->lock);
>> +	if (worker->cur_work)
>> +		l->cur_work++;
>> +	if (worker->next_work)
>> +		l->next_work++;
>> +	raw_spin_unlock(&worker->lock);
>> +	return false;
>> +}
>> +
>> +void io_wq_dump(struct io_uring_task *tctx)
>> +{
>> +	struct io_wq_work_node *node, *prev;
>> +	struct io_wq *wq = tctx->io_wq;
>> +	struct worker_lookup l = { };
>> +	int i, work_items;
>> +
>> +	if (!wq) {
>> +		printk("%s: no wq\n", __FUNCTION__);
>> +		return;
>> +	}
>> +
>> +	work_items = 0;
>> +	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
>> +		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
>> +
>> +		raw_spin_lock(&acct->lock);
>> +		wq_list_for_each(node, prev, &acct->work_list)
>> +			work_items++;
>> +		raw_spin_unlock(&acct->lock);
>> +	}
>> +
>> +	rcu_read_lock();
>> +	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
>> +	rcu_read_unlock();
>> +
>> +	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
>> +							l.cur_work, l.next_work);
>> +}
>> +
>> static __init int io_wq_init(void)
>> {
>> 	int ret;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index ed254076c723..c0bd35e5429a 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
>> 
>> 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
>> 	if (work->flags & IO_WQ_WORK_CANCEL) {
>> +		WARN_ON_ONCE(1);
>> fail:
>> +		WARN_ON_ONCE(1);
>> 		io_req_task_queue_fail(req, err);
>> 		return;
>> 	}
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index ee9f43bed49a..250ae820340c 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
>> 	do_exit((error_code&0xff)<<8);
>> }
>> 
>> +void io_wq_dump(struct io_uring_task *);
>> +
>> /*
>>  * Take down every thread in the group.  This is called by fatal signals
>>  * as well as by sys_exit_group (below).
>> @@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
>> 		else if (sig->group_exec_task)
>> 			exit_code = 0;
>> 		else {
>> +			if (!strncmp(current->comm, "mariadbd", 8) &&
>> +			    current->io_uring)
>> +				io_wq_dump(current->io_uring);
>> 			sig->group_exit_code = exit_code;
>> 			sig->flags = SIGNAL_GROUP_EXIT;
>> 			zap_other_threads(current);
> 
> Unfortunately it's only returning work_items=0, cur=0, next=0, so that was a bit
> of a red herring.

Another data point on this...the tests run in a loop, approximately 2 seconds per test run.  For runs where things are *not* corrupted, I do not see the io_wq_dump message printed.  For runs that do show corruption, I see a batch of io_wq_dump: work_items=0, cur=0, next=0 messages.

I wonder if what we're actually seeing here is the corruption being detected by mariadb, and it self-terminating in userspace, hence the io_uring system seeing the sudden termination of the application and associated workers.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 18:36                                           ` Timothy Pearson
  2023-11-08 18:51                                             ` Timothy Pearson
@ 2023-11-08 19:06                                             ` Jens Axboe
  2023-11-08 22:05                                               ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 19:06 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 11:36 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 11:57:59 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 10:49 AM, Jens Axboe wrote:
>>> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>>>> friends), this would also cancel inflight IO.
>>>>>>>
>>>>>>> If this is the case, then you could try with this one to just disable
>>>>>>> that and see if the corruption goes away:
>>>>>>
>>>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>>>> the signal generation into the get_signal() call, which is apparently
>>>>>> sending SIGKILL when the thread group is marked for termination --
>>>>>> this is in turn why the PID fields etc. are all zero.
>>>>>
>>>>> That's good news though, because I'm continually pondering why powerpc
>>>>> is different here.
>>>>>
>>>>>> Investigation continues.
>>>>>
>>>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>>>> hitting this in get_signal():
>>>>>
>>>>> 		/* Has this task already been marked for death? */
>>>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>>>> 		     signal->group_exec_task) {
>>>>> 			clear_siginfo(&ksig->info);
>>>>> 			ksig->info.si_signo = signr = SIGKILL;
>>>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>>>> 				&sighand->action[SIGKILL - 1]);
>>>>> 			recalc_sigpending();
>>>>> 			goto fatal;
>>>>> 		}
>>>>>
>>>>> which is either exec (which we verified it is not), so I don't see
>>>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>>>
>>>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>>>> this point and they get canceled and hence never written. If this does
>>>>> trigger, maybe try and do your sleep trick there too and see if that
>>>>> gets rid of it.
>>>>
>>>> Yes, it does indeed trigger.  Is there a way to directly check for the
>>>> unstarted requests?
>>>
>>> Let me hack up a debug patch for this, give me a minute.
>>
>> This should do it - whenever this condition hits, you should see
>> something ala:
>>
>> [   97.960877] io_wq_dump: work_items=0, cur=0, next=0
>>
>> in dmesg. work_items is the number of work items we found that haven't
>> been scheduled yet. cur is what a worker is currently processing, and
>> next is basically a way for cancel to find a work item before it gets
>> assigned. work_items and next may get canceled, work_items should always
>> finish for storage IO, since signals don't interrupt them.
>>
>>
>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>> index 522196dfb0ff..643b8e9de518 100644
>> --- a/io_uring/io-wq.c
>> +++ b/io_uring/io-wq.c
>> @@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>> 	struct io_wq *wq = worker->wq;
>> 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
>>
>> +	WARN_ON_ONCE(do_kill);
>> +
>> 	do {
>> 		struct io_wq_work *work;
>>
>> @@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void
>> *data)
>> static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
>> {
>> 	do {
>> +		WARN_ON_ONCE(1);
>> 		work->flags |= IO_WQ_WORK_CANCEL;
>> 		wq->do_work(work);
>> 		work = wq->free_work(work);
>> @@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work
>> *work)
>> 	 */
>> 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
>> 	    (work->flags & IO_WQ_WORK_CANCEL)) {
>> +		WARN_ON_ONCE(1);
>> 		io_run_cancel(work, wq);
>> 		return;
>> 	}
>> @@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
>> 	return 0;
>> }
>>
>> +struct worker_lookup {
>> +	int cur_work;
>> +	int next_work;
>> +};
>> +
>> +static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
>> +{
>> +	struct worker_lookup *l = data;
>> +
>> +	raw_spin_lock(&worker->lock);
>> +	if (worker->cur_work)
>> +		l->cur_work++;
>> +	if (worker->next_work)
>> +		l->next_work++;
>> +	raw_spin_unlock(&worker->lock);
>> +	return false;
>> +}
>> +
>> +void io_wq_dump(struct io_uring_task *tctx)
>> +{
>> +	struct io_wq_work_node *node, *prev;
>> +	struct io_wq *wq = tctx->io_wq;
>> +	struct worker_lookup l = { };
>> +	int i, work_items;
>> +
>> +	if (!wq) {
>> +		printk("%s: no wq\n", __FUNCTION__);
>> +		return;
>> +	}
>> +
>> +	work_items = 0;
>> +	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
>> +		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
>> +
>> +		raw_spin_lock(&acct->lock);
>> +		wq_list_for_each(node, prev, &acct->work_list)
>> +			work_items++;
>> +		raw_spin_unlock(&acct->lock);
>> +	}
>> +
>> +	rcu_read_lock();
>> +	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
>> +	rcu_read_unlock();
>> +
>> +	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
>> +							l.cur_work, l.next_work);
>> +}
>> +
>> static __init int io_wq_init(void)
>> {
>> 	int ret;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index ed254076c723..c0bd35e5429a 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
>>
>> 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
>> 	if (work->flags & IO_WQ_WORK_CANCEL) {
>> +		WARN_ON_ONCE(1);
>> fail:
>> +		WARN_ON_ONCE(1);
>> 		io_req_task_queue_fail(req, err);
>> 		return;
>> 	}
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index ee9f43bed49a..250ae820340c 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
>> 	do_exit((error_code&0xff)<<8);
>> }
>>
>> +void io_wq_dump(struct io_uring_task *);
>> +
>> /*
>>  * Take down every thread in the group.  This is called by fatal signals
>>  * as well as by sys_exit_group (below).
>> @@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
>> 		else if (sig->group_exec_task)
>> 			exit_code = 0;
>> 		else {
>> +			if (!strncmp(current->comm, "mariadbd", 8) &&
>> +			    current->io_uring)
>> +				io_wq_dump(current->io_uring);
>> 			sig->group_exit_code = exit_code;
>> 			sig->flags = SIGNAL_GROUP_EXIT;
>> 			zap_other_threads(current);
> 
> Unfortunately it's only returning work_items=0, cur=0, next=0, so that
> was a bit of a red herring.

Well that's probably a good thing, as it also didn't make a lot of sense
:-)

> I have been giving some thought to the CPU pinning of the workers, and
> one thing that may have been overlooked is that this could potentially
> force-serialize worker operations.  Did you just have to pin the io
> workers or did the workqueue also need to be pinned for the corruption
> to disappear?

Not sure I follow, the workers ARE the workqueue. For the pinning, I
just made sure that the workers are on the same CPU. I honestly don't
remember all the details there outside of what I can read back from the
emails I sent, it's been a while. My suspicion back then was that it was
some weird ppc cache aliasing effect with the copy into kernel memory
happening on cpu X, and then we immediately punt it to cpu Y for
processing.

I didn't see the corruption happening if I just forced the requests to
complete inline (eg always on the CPU being submitted, no punt to
io-wq), and I didn't see it if I ensured that the io-wq worker was
running on the same CPU as the submitter.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 18:51                                             ` Timothy Pearson
@ 2023-11-08 19:08                                               ` Jens Axboe
  0 siblings, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 19:08 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 11:51 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "Jens Axboe" <axboe@kernel.dk>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 12:36:01 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Wednesday, November 8, 2023 11:57:59 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/8/23 10:49 AM, Jens Axboe wrote:
>>>> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>>>>> friends), this would also cancel inflight IO.
>>>>>>>>
>>>>>>>> If this is the case, then you could try with this one to just disable
>>>>>>>> that and see if the corruption goes away:
>>>>>>>
>>>>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>>>>> the signal generation into the get_signal() call, which is apparently
>>>>>>> sending SIGKILL when the thread group is marked for termination --
>>>>>>> this is in turn why the PID fields etc. are all zero.
>>>>>>
>>>>>> That's good news though, because I'm continually pondering why powerpc
>>>>>> is different here.
>>>>>>
>>>>>>> Investigation continues.
>>>>>>
>>>>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>>>>> hitting this in get_signal():
>>>>>>
>>>>>> 		/* Has this task already been marked for death? */
>>>>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>>>>> 		     signal->group_exec_task) {
>>>>>> 			clear_siginfo(&ksig->info);
>>>>>> 			ksig->info.si_signo = signr = SIGKILL;
>>>>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>>>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>>>>> 				&sighand->action[SIGKILL - 1]);
>>>>>> 			recalc_sigpending();
>>>>>> 			goto fatal;
>>>>>> 		}
>>>>>>
>>>>>> which is either exec (which we verified it is not), so I don't see
>>>>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>>>>
>>>>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>>>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>>>>> this point and they get canceled and hence never written. If this does
>>>>>> trigger, maybe try and do your sleep trick there too and see if that
>>>>>> gets rid of it.
>>>>>
>>>>> Yes, it does indeed trigger.  Is there a way to directly check for the
>>>>> unstarted requests?
>>>>
>>>> Let me hack up a debug patch for this, give me a minute.
>>>
>>> This should do it - whenever this condition hits, you should see
>>> something ala:
>>>
>>> [   97.960877] io_wq_dump: work_items=0, cur=0, next=0
>>>
>>> in dmesg. work_items is the number of work items we found that haven't
>>> been scheduled yet. cur is what a worker is currently processing, and
>>> next is basically a way for cancel to find a work item before it gets
>>> assigned. work_items and next may get canceled, work_items should always
>>> finish for storage IO, since signals don't interrupt them.
>>>
>>>
>>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>>> index 522196dfb0ff..643b8e9de518 100644
>>> --- a/io_uring/io-wq.c
>>> +++ b/io_uring/io-wq.c
>>> @@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>>> 	struct io_wq *wq = worker->wq;
>>> 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
>>>
>>> +	WARN_ON_ONCE(do_kill);
>>> +
>>> 	do {
>>> 		struct io_wq_work *work;
>>>
>>> @@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void
>>> *data)
>>> static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
>>> {
>>> 	do {
>>> +		WARN_ON_ONCE(1);
>>> 		work->flags |= IO_WQ_WORK_CANCEL;
>>> 		wq->do_work(work);
>>> 		work = wq->free_work(work);
>>> @@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work
>>> *work)
>>> 	 */
>>> 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
>>> 	    (work->flags & IO_WQ_WORK_CANCEL)) {
>>> +		WARN_ON_ONCE(1);
>>> 		io_run_cancel(work, wq);
>>> 		return;
>>> 	}
>>> @@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
>>> 	return 0;
>>> }
>>>
>>> +struct worker_lookup {
>>> +	int cur_work;
>>> +	int next_work;
>>> +};
>>> +
>>> +static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
>>> +{
>>> +	struct worker_lookup *l = data;
>>> +
>>> +	raw_spin_lock(&worker->lock);
>>> +	if (worker->cur_work)
>>> +		l->cur_work++;
>>> +	if (worker->next_work)
>>> +		l->next_work++;
>>> +	raw_spin_unlock(&worker->lock);
>>> +	return false;
>>> +}
>>> +
>>> +void io_wq_dump(struct io_uring_task *tctx)
>>> +{
>>> +	struct io_wq_work_node *node, *prev;
>>> +	struct io_wq *wq = tctx->io_wq;
>>> +	struct worker_lookup l = { };
>>> +	int i, work_items;
>>> +
>>> +	if (!wq) {
>>> +		printk("%s: no wq\n", __FUNCTION__);
>>> +		return;
>>> +	}
>>> +
>>> +	work_items = 0;
>>> +	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
>>> +		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
>>> +
>>> +		raw_spin_lock(&acct->lock);
>>> +		wq_list_for_each(node, prev, &acct->work_list)
>>> +			work_items++;
>>> +		raw_spin_unlock(&acct->lock);
>>> +	}
>>> +
>>> +	rcu_read_lock();
>>> +	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
>>> +	rcu_read_unlock();
>>> +
>>> +	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
>>> +							l.cur_work, l.next_work);
>>> +}
>>> +
>>> static __init int io_wq_init(void)
>>> {
>>> 	int ret;
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index ed254076c723..c0bd35e5429a 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
>>>
>>> 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
>>> 	if (work->flags & IO_WQ_WORK_CANCEL) {
>>> +		WARN_ON_ONCE(1);
>>> fail:
>>> +		WARN_ON_ONCE(1);
>>> 		io_req_task_queue_fail(req, err);
>>> 		return;
>>> 	}
>>> diff --git a/kernel/exit.c b/kernel/exit.c
>>> index ee9f43bed49a..250ae820340c 100644
>>> --- a/kernel/exit.c
>>> +++ b/kernel/exit.c
>>> @@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
>>> 	do_exit((error_code&0xff)<<8);
>>> }
>>>
>>> +void io_wq_dump(struct io_uring_task *);
>>> +
>>> /*
>>>  * Take down every thread in the group.  This is called by fatal signals
>>>  * as well as by sys_exit_group (below).
>>> @@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
>>> 		else if (sig->group_exec_task)
>>> 			exit_code = 0;
>>> 		else {
>>> +			if (!strncmp(current->comm, "mariadbd", 8) &&
>>> +			    current->io_uring)
>>> +				io_wq_dump(current->io_uring);
>>> 			sig->group_exit_code = exit_code;
>>> 			sig->flags = SIGNAL_GROUP_EXIT;
>>> 			zap_other_threads(current);
>>
>> Unfortunately it's only returning work_items=0, cur=0, next=0, so that was a bit
>> of a red herring.
> 
> Another data point on this...the tests run in a loop, approximately 2
> seconds per test run.  For runs where things are *not* corrupted, I do
> not see the io_wq_dump message printed.  For runs that do show
> corruption, I see a batch of io_wq_dump: work_items=0, cur=0, next=0
> messages.
> 
> I wonder if what we're actually seeing here is the corruption being
> detected by mariadb, and it self-terminating in userspace, hence the
> io_uring system seeing the sudden termination of the application and
> associated workers.

I ran it here on an x86-64 box, and I see the dump for every loop of
mtr. They are all zereos here too, but I do see it for everyone. Was
just kind of assuming that SIGKILL is perhaps how it shuts a loop down,
but this is pure speculation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 19:06                                             ` Jens Axboe
@ 2023-11-08 22:05                                               ` Jens Axboe
  2023-11-08 22:15                                                 ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 22:05 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 12:06 PM, Jens Axboe wrote:
>> I have been giving some thought to the CPU pinning of the workers, and
>> one thing that may have been overlooked is that this could potentially
>> force-serialize worker operations.  Did you just have to pin the io
>> workers or did the workqueue also need to be pinned for the corruption
>> to disappear?
> 
> Not sure I follow, the workers ARE the workqueue. For the pinning, I
> just made sure that the workers are on the same CPU. I honestly don't
> remember all the details there outside of what I can read back from the
> emails I sent, it's been a while. My suspicion back then was that it was
> some weird ppc cache aliasing effect with the copy into kernel memory
> happening on cpu X, and then we immediately punt it to cpu Y for
> processing.

You could try something like this, though I still need to verify that we
never end up running it on the wrong CPU. But may be worth a shot, for
debug purposes.

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d3009d56af0b..3fc9912f6306 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -21,6 +21,7 @@ struct io_wq_work {
 	unsigned flags;
 	/* place it here instead of io_kiocb as it fills padding and saves 4B */
 	int cancel_seq;
+	int cpu;
 };
 
 struct io_fixed_file {
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..bdaae56c6517 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -59,6 +59,7 @@ struct io_worker {
 	unsigned long create_state;
 	struct callback_head create_work;
 	int create_index;
+	int last_cpu;
 
 	union {
 		struct rcu_head rcu;
@@ -286,6 +287,10 @@ static bool io_wq_activate_free_worker(struct io_wq *wq,
 			io_worker_release(worker);
 			continue;
 		}
+		if (worker->last_cpu != raw_smp_processor_id()) {
+			io_worker_release(worker);
+			continue;
+		}
 		/*
 		 * If the worker is already running, it's either already
 		 * starting work or finishing work. In either case, if it does
@@ -581,6 +586,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
 		} else {
 			break;
 		}
+		if (work->cpu != worker->last_cpu)
+			printk("work cpu %d, me %d\n", work->cpu, worker->last_cpu);
 		io_assign_current_work(worker, work);
 		__set_current_state(TASK_RUNNING);
 
@@ -639,6 +646,7 @@ static int io_wq_worker(void *data)
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		long ret;
 
+		worker->last_cpu = raw_smp_processor_id();
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		/*
@@ -664,7 +672,10 @@ static int io_wq_worker(void *data)
 		raw_spin_unlock(&wq->lock);
 		if (io_run_task_work())
 			continue;
+		worker->last_cpu = raw_smp_processor_id();
 		ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
+		if (worker->last_cpu != raw_smp_processor_id())
+			printk("was on %d, now %d\n", worker->last_cpu, raw_smp_processor_id());
 		if (signal_pending(current)) {
 			struct ksignal ksig;
 
@@ -725,9 +736,17 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 static void io_init_new_worker(struct io_wq *wq, struct io_worker *worker,
 			       struct task_struct *tsk)
 {
+	cpumask_var_t new_mask;
+
 	tsk->worker_private = worker;
 	worker->task = tsk;
-	set_cpus_allowed_ptr(tsk, wq->cpu_mask);
+
+	if (alloc_cpumask_var(&new_mask, GFP_NOIO)) {
+		cpumask_clear(new_mask);
+		cpumask_set_cpu(worker->last_cpu, new_mask);
+		set_cpus_allowed_ptr(tsk, new_mask);
+		free_cpumask_var(new_mask);
+	}
 
 	raw_spin_lock(&wq->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wq->free_list);
@@ -835,6 +854,7 @@ static bool create_io_worker(struct io_wq *wq, int index)
 	refcount_set(&worker->ref, 1);
 	worker->wq = wq;
 	raw_spin_lock_init(&worker->lock);
+	worker->last_cpu = raw_smp_processor_id();
 	init_completion(&worker->ref_done);
 
 	if (index == IO_WQ_ACCT_BOUND)
@@ -928,6 +948,8 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	unsigned work_flags = work->flags;
 	bool do_create;
 
+	work->cpu = raw_smp_processor_id();
+
 	/*
 	 * If io-wq is exiting for this task, or if the request has explicitly
 	 * been marked as one that should not get executed, cancel it here.

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 22:05                                               ` Jens Axboe
@ 2023-11-08 22:15                                                 ` Timothy Pearson
  2023-11-08 22:18                                                   ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 22:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 4:05:50 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 12:06 PM, Jens Axboe wrote:
>>> I have been giving some thought to the CPU pinning of the workers, and
>>> one thing that may have been overlooked is that this could potentially
>>> force-serialize worker operations.  Did you just have to pin the io
>>> workers or did the workqueue also need to be pinned for the corruption
>>> to disappear?
>> 
>> Not sure I follow, the workers ARE the workqueue. For the pinning, I
>> just made sure that the workers are on the same CPU. I honestly don't
>> remember all the details there outside of what I can read back from the
>> emails I sent, it's been a while. My suspicion back then was that it was
>> some weird ppc cache aliasing effect with the copy into kernel memory
>> happening on cpu X, and then we immediately punt it to cpu Y for
>> processing.
> 
> You could try something like this, though I still need to verify that we
> never end up running it on the wrong CPU. But may be worth a shot, for
> debug purposes.

Getting a bunch of:

[   25.268005] work cpu 5, me 23
[   25.269217] work cpu 5, me 23
[   25.269804] work cpu 5, me 23
[   25.269914] work cpu 5, me 23
[   25.306967] work cpu 22, me 41
[   25.308821] work cpu 22, me 41
[   25.310461] work cpu 22, me 39
[   25.310975] work cpu 22, me 39
[   25.310995] work cpu 39, me 41
[   25.311001] work cpu 39, me 22
[   25.313916] work cpu 39, me 22
[   25.320227] work cpu 39, me 22
[   25.328760] work cpu 33, me 39
[   25.331756] work cpu 33, me 39
[   28.159191] work cpu 27, me 39
[   28.160456] work cpu 27, me 39
[   28.160511] work cpu 27, me 39
[   28.160998] work cpu 39, me 27
[   28.162042] work cpu 39, me 27
[   34.422210] work cpu 0, me 17
[   34.423543] work cpu 0, me 17
[   38.144123] work cpu 14, me 8
[   38.144284] work cpu 14, me 8
[   38.146765] work cpu 14, me 8
[   38.146884] work cpu 14, me 8
[   38.148067] work cpu 14, me 8
[   38.154664] work cpu 14, me 8
[   38.159555] work cpu 35, me 14
[   38.159752] work cpu 35, me 14
[   38.161737] work cpu 35, me 14
[   38.162896] work cpu 35, me 8
[   38.164669] work cpu 35, me 14
[   38.167428] work cpu 35, me 14
[   38.167549] work cpu 35, me 14
[   38.169941] work cpu 35, me 14
[   38.170002] work cpu 35, me 14
[   38.175143] work cpu 35, me 14
[   41.859077] work cpu 5, me 28
[   41.859918] work cpu 5, me 28
[   41.860819] work cpu 28, me 5
[   41.862670] work cpu 28, me 5
[   41.864077] work cpu 28, me 5
[   41.867216] work cpu 28, me 5
[   41.867375] work cpu 28, me 5
[   41.873102] work cpu 28, me 5
[   41.879452] work cpu 28, me 5
[   49.012472] work cpu 5, me 26
[   49.014336] work cpu 5, me 26
[   49.015319] work cpu 5, me 26
[   62.911760] work cpu 18, me 0
[   62.913764] work cpu 18, me 0
[   62.915512] work cpu 22, me 18
[   62.917665] work cpu 22, me 18
[   62.917959] work cpu 22, me 0
[   70.677984] work cpu 20, me 2
[   70.679304] work cpu 20, me 2
[   70.679375] work cpu 20, me 2
[   70.682036] work cpu 20, me 2
[   70.684209] work cpu 2, me 20
[   70.685511] work cpu 2, me 20
[   70.686503] work cpu 2, me 20
[   74.446195] work cpu 7, me 2
[   74.446473] work cpu 7, me 2
[   74.448095] work cpu 7, me 2
[   74.457561] work cpu 7, me 2
[   74.471382] work cpu 25, me 7
[   74.471970] work cpu 25, me 7
[   74.474480] work cpu 25, me 7
[   77.486199] work cpu 33, me 3
[   77.488119] work cpu 33, me 3
[   77.551875] work cpu 33, me 5
[   77.553058] work cpu 5, me 33
[   77.554057] work cpu 5, me 33
[   77.555447] work cpu 5, me 33
[   87.688015] work cpu 11, me 26
[   87.738868] work cpu 2, me 23
[   87.740051] work cpu 2, me 23
[   87.740298] work cpu 23, me 2
[   87.741811] work cpu 23, me 2
[   90.848175] work cpu 34, me 1
[   90.848866] work cpu 34, me 1
[   90.850100] work cpu 34, me 1
[   90.852995] work cpu 34, me 1
[   94.206858] work cpu 26, me 32
[   94.207193] work cpu 26, me 32
[   94.212528] work cpu 38, me 26
[   94.213082] work cpu 38, me 26
[  101.536345] work cpu 36, me 7
[  101.537108] work cpu 36, me 7
[  101.538528] work cpu 36, me 7
[  101.538710] work cpu 36, me 7
[  101.539628] work cpu 36, me 7
[  101.541145] work cpu 36, me 7
[  101.543162] work cpu 36, me 7
[  101.543241] work cpu 36, me 7
[  101.545231] work cpu 26, me 7
[  101.545292] work cpu 26, me 7
[  101.546434] work cpu 26, me 7
[  101.547094] work cpu 26, me 7
[  101.547408] work cpu 26, me 7
[  101.548091] work cpu 26, me 36
[  104.732140] work cpu 4, me 26
[  104.732518] work cpu 4, me 26
[  104.734766] work cpu 4, me 26
[  104.734877] work cpu 4, me 26
[  104.736113] work cpu 4, me 26
[  104.736401] work cpu 4, me 26
[  107.962200] work cpu 3, me 4
[  107.962338] work cpu 3, me 4
[  107.963646] work cpu 3, me 4
[  107.963752] work cpu 3, me 4
[  107.965676] work cpu 3, me 4
[  107.966404] work cpu 3, me 4
[  107.967589] work cpu 3, me 4
[  107.967638] work cpu 3, me 4
[  107.970517] work cpu 13, me 3
[  107.973168] work cpu 13, me 4
[  107.975058] work cpu 13, me 4
[  107.975133] work cpu 13, me 4
[  107.977069] work cpu 13, me 3
[  107.977617] work cpu 13, me 4
[  107.979458] work cpu 13, me 4
[  107.980097] work cpu 13, me 4
[  107.980746] work cpu 13, me 4
[  107.982719] work cpu 13, me 4
[  107.983298] work cpu 13, me 4
[  107.984179] work cpu 13, me 3
[  107.986206] work cpu 13, me 4
[  107.987192] work cpu 27, me 13
[  107.988974] work cpu 27, me 13
[  107.990738] work cpu 27, me 13
[  107.991316] work cpu 27, me 13
[  107.991608] work cpu 27, me 4
[  107.992647] work cpu 27, me 3
[  107.993783] work cpu 27, me 3
[  107.995322] work cpu 27, me 4
[  107.995893] work cpu 27, me 3
[  107.996312] work cpu 27, me 13
[  111.608060] work cpu 2, me 27
[  125.790232] work cpu 37, me 1
[  125.790587] work cpu 37, me 1
[  125.791879] work cpu 37, me 1
[  125.794301] work cpu 37, me 1
[  132.545885] work cpu 22, me 37
[  132.548194] work cpu 22, me 37
[  132.549514] work cpu 22, me 37
[  132.550110] work cpu 22, me 37
[  132.550695] work cpu 22, me 37
[  132.551730] work cpu 22, me 37
[  132.552136] work cpu 22, me 37
[  132.564102] work cpu 31, me 22
[  132.564324] work cpu 31, me 22
[  138.957170] work cpu 29, me 39
[  138.958278] work cpu 29, me 39
[  138.959922] work cpu 29, me 39
[  138.961734] work cpu 29, me 39
[  138.963112] work cpu 37, me 29
[  138.963235] work cpu 37, me 29
[  138.965017] work cpu 37, me 29
[  138.965071] work cpu 37, me 39
[  138.967289] work cpu 37, me 29
[  138.967366] work cpu 37, me 29
[  138.970359] work cpu 37, me 29
[  138.970420] work cpu 37, me 29
[  138.972597] work cpu 29, me 37
[  138.973354] work cpu 29, me 37
[  138.973656] work cpu 29, me 37
[  138.974402] work cpu 29, me 37
[  138.975024] work cpu 29, me 37
[  153.025008] work cpu 32, me 10
[  153.026528] work cpu 32, me 10
[  153.027503] work cpu 32, me 10
[  153.029120] work cpu 32, me 10

They happen both for corrupt and non-corrupt runs.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 22:15                                                 ` Timothy Pearson
@ 2023-11-08 22:18                                                   ` Jens Axboe
  2023-11-08 22:28                                                     ` Timothy Pearson
  2023-11-08 23:58                                                     ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 22:18 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 3:15 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 4:05:50 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 12:06 PM, Jens Axboe wrote:
>>>> I have been giving some thought to the CPU pinning of the workers, and
>>>> one thing that may have been overlooked is that this could potentially
>>>> force-serialize worker operations.  Did you just have to pin the io
>>>> workers or did the workqueue also need to be pinned for the corruption
>>>> to disappear?
>>>
>>> Not sure I follow, the workers ARE the workqueue. For the pinning, I
>>> just made sure that the workers are on the same CPU. I honestly don't
>>> remember all the details there outside of what I can read back from the
>>> emails I sent, it's been a while. My suspicion back then was that it was
>>> some weird ppc cache aliasing effect with the copy into kernel memory
>>> happening on cpu X, and then we immediately punt it to cpu Y for
>>> processing.
>>
>> You could try something like this, though I still need to verify that we
>> never end up running it on the wrong CPU. But may be worth a shot, for
>> debug purposes.
> 
> Getting a bunch of:
> 
> [   25.268005] work cpu 5, me 23
> [   25.269217] work cpu 5, me 23
> [   25.269804] work cpu 5, me 23

[snip]

> [  153.029120] work cpu 32, me 10
> 
> They happen both for corrupt and non-corrupt runs.

OK, let me actually test this thing and see if I can make it solid
first...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 22:18                                                   ` Jens Axboe
@ 2023-11-08 22:28                                                     ` Timothy Pearson
  2023-11-08 23:58                                                     ` Jens Axboe
  1 sibling, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-08 22:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Wednesday, November 8, 2023 4:18:20 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 3:15 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Wednesday, November 8, 2023 4:05:50 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/8/23 12:06 PM, Jens Axboe wrote:
>>>>> I have been giving some thought to the CPU pinning of the workers, and
>>>>> one thing that may have been overlooked is that this could potentially
>>>>> force-serialize worker operations.  Did you just have to pin the io
>>>>> workers or did the workqueue also need to be pinned for the corruption
>>>>> to disappear?
>>>>
>>>> Not sure I follow, the workers ARE the workqueue. For the pinning, I
>>>> just made sure that the workers are on the same CPU. I honestly don't
>>>> remember all the details there outside of what I can read back from the
>>>> emails I sent, it's been a while. My suspicion back then was that it was
>>>> some weird ppc cache aliasing effect with the copy into kernel memory
>>>> happening on cpu X, and then we immediately punt it to cpu Y for
>>>> processing.
>>>
>>> You could try something like this, though I still need to verify that we
>>> never end up running it on the wrong CPU. But may be worth a shot, for
>>> debug purposes.
>> 
>> Getting a bunch of:
>> 
>> [   25.268005] work cpu 5, me 23
>> [   25.269217] work cpu 5, me 23
>> [   25.269804] work cpu 5, me 23
> 
> [snip]
> 
>> [  153.029120] work cpu 32, me 10
>> 
>> They happen both for corrupt and non-corrupt runs.
> 
> OK, let me actually test this thing and see if I can make it solid
> first...

No problem.  I'll test as soon as you have it stable on your side.

Thanks!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 22:18                                                   ` Jens Axboe
  2023-11-08 22:28                                                     ` Timothy Pearson
@ 2023-11-08 23:58                                                     ` Jens Axboe
  2023-11-09 15:12                                                       ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-08 23:58 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 3:18 PM, Jens Axboe wrote:
> OK, let me actually test this thing and see if I can make it solid
> first...

Here's a suitable hack - it just creates a new io worker for each item,
ensuring that that worker is run on the same CPU.

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..3a14bc35f9ba 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -921,6 +921,67 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
 	return work == data;
 }
 
+struct work_data {
+	struct io_wq_work *work;
+	struct io_wq *wq;
+	int cpu;
+};
+
+static int io_wq_single_worker(void *data)
+{
+	struct work_data *wd = data;
+	struct io_wq_work *work = wd->work;
+	struct io_wq *wq = wd->wq;
+
+	WARN_ON_ONCE(wd->cpu != raw_smp_processor_id());
+	kfree(wd);
+	do {
+		if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
+			work->flags |= IO_WQ_WORK_CANCEL;
+		wq->do_work(work);
+		work = wq->free_work(work);
+	} while (work);
+
+	do_exit(0);
+	return 0;
+}
+
+static int io_wq_create_single(struct io_wq *wq, struct io_wq_work *work)
+{
+	struct task_struct *tsk;
+	struct work_data *wd;
+	cpumask_var_t new_mask;
+
+	wd = kmalloc(sizeof(*wd), GFP_NOIO);
+	if (!wd)
+		return false;
+
+	if (!alloc_cpumask_var(&new_mask, GFP_NOIO)) {
+		kfree(wd);
+		return false;
+	}
+
+	wd->work = work;
+	wd->cpu = raw_smp_processor_id();
+	wd->wq = wq,
+
+	cpumask_clear(new_mask);
+	cpumask_set_cpu(wd->cpu, new_mask);
+
+	tsk = create_io_thread(io_wq_single_worker, wd, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		tsk->worker_private = tsk;
+		set_cpus_allowed_ptr(tsk, new_mask);
+		free_cpumask_var(new_mask);
+		wake_up_new_task(tsk);
+		return true;
+	}
+
+	free_cpumask_var(new_mask);
+	kfree(wd);
+	return false;
+}
+
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 {
 	struct io_wq_acct *acct = io_work_get_acct(wq, work);
@@ -938,6 +999,10 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 		return;
 	}
 
+	if (io_wq_create_single(wq, work))
+		return;
+
+	WARN_ON_ONCE(1);
 	raw_spin_lock(&acct->lock);
 	io_wq_insert_work(wq, work);
 	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-08 23:58                                                     ` Jens Axboe
@ 2023-11-09 15:12                                                       ` Jens Axboe
  2023-11-09 17:00                                                         ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-09 15:12 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/8/23 4:58 PM, Jens Axboe wrote:
> On 11/8/23 3:18 PM, Jens Axboe wrote:
>> OK, let me actually test this thing and see if I can make it solid
>> first...
> 
> Here's a suitable hack - it just creates a new io worker for each item,
> ensuring that that worker is run on the same CPU.

Turns out I didn't send you the current one, this one was older and
untested. Please try this one instead.

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..41b4e281db8c 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -178,6 +178,8 @@ bool io_wq_worker_stopped(void)
 {
 	struct io_worker *worker = current->worker_private;
 
+	if (current->worker_private == current)
+		return false;
 	if (WARN_ON_ONCE(!io_wq_current_is_worker()))
 		return true;
 
@@ -693,7 +695,7 @@ void io_wq_worker_running(struct task_struct *tsk)
 {
 	struct io_worker *worker = tsk->worker_private;
 
-	if (!worker)
+	if (!worker || tsk->worker_private == tsk)
 		return;
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
@@ -711,7 +713,7 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 {
 	struct io_worker *worker = tsk->worker_private;
 
-	if (!worker)
+	if (!worker || tsk->worker_private == tsk)
 		return;
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
@@ -921,6 +923,67 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
 	return work == data;
 }
 
+struct work_data {
+	struct io_wq_work *work;
+	struct io_wq *wq;
+	int cpu;
+};
+
+static int io_wq_single_worker(void *data)
+{
+	struct work_data *wd = data;
+	struct io_wq_work *work = wd->work;
+	struct io_wq *wq = wd->wq;
+
+	WARN_ON_ONCE(wd->cpu != raw_smp_processor_id());
+	kfree(wd);
+	do {
+		if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
+			work->flags |= IO_WQ_WORK_CANCEL;
+		wq->do_work(work);
+		work = wq->free_work(work);
+	} while (work);
+
+	do_exit(0);
+	return 0;
+}
+
+static int io_wq_create_single(struct io_wq *wq, struct io_wq_work *work)
+{
+	struct task_struct *tsk;
+	struct work_data *wd;
+	cpumask_var_t new_mask;
+
+	wd = kmalloc(sizeof(*wd), GFP_NOIO);
+	if (!wd)
+		return false;
+
+	if (!alloc_cpumask_var(&new_mask, GFP_NOIO)) {
+		kfree(wd);
+		return false;
+	}
+
+	wd->work = work;
+	wd->cpu = raw_smp_processor_id();
+	wd->wq = wq,
+
+	cpumask_clear(new_mask);
+	cpumask_set_cpu(wd->cpu, new_mask);
+
+	tsk = create_io_thread(io_wq_single_worker, wd, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		tsk->worker_private = tsk;
+		set_cpus_allowed_ptr(tsk, new_mask);
+		free_cpumask_var(new_mask);
+		wake_up_new_task(tsk);
+		return true;
+	}
+
+	free_cpumask_var(new_mask);
+	kfree(wd);
+	return false;
+}
+
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 {
 	struct io_wq_acct *acct = io_work_get_acct(wq, work);
@@ -938,6 +1001,10 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 		return;
 	}
 
+	if (io_wq_create_single(wq, work))
+		return;
+
+	WARN_ON_ONCE(1);
 	raw_spin_lock(&acct->lock);
 	io_wq_insert_work(wq, work);
 	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 15:12                                                       ` Jens Axboe
@ 2023-11-09 17:00                                                         ` Timothy Pearson
  2023-11-09 17:17                                                           ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-09 17:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 9:12:07 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/8/23 4:58 PM, Jens Axboe wrote:
>> On 11/8/23 3:18 PM, Jens Axboe wrote:
>>> OK, let me actually test this thing and see if I can make it solid
>>> first...
>> 
>> Here's a suitable hack - it just creates a new io worker for each item,
>> ensuring that that worker is run on the same CPU.
> 
> Turns out I didn't send you the current one, this one was older and
> untested. Please try this one instead.

I'm still seeing an oops with this newer patchset applied:
[   76.419788] Kernel attempted to read user page (9c) - exploit attempt? (uid: 0)
[   76.419853] BUG: Kernel NULL pointer dereference on read at 0x0000009c
[   76.419887] Faulting instruction address: 0xc0000000008374d4
[   76.419919] Oops: Kernel access of bad area, sig: 11 [#1]
<snip>
[   76.429125] Call Trace:
[   76.429142] [c000000009a8f800] [c000000000ea4e7c] schedule+0xfc/0x150 (unreliable)
[   76.429190] [c000000009a8f870] [c000000000ead8f0] schedule_timeout+0x170/0x1e0
[   76.429238] [c000000009a8f940] [c000000000ea3fc8] io_schedule_timeout+0x68/0xa0
[   76.429285] [c000000009a8f970] [c0000000007adcbc] blk_io_schedule+0x3c/0x70
[   76.429326] [c000000009a8f990] [c00000000065da84] __iomap_dio_rw+0x8d4/0x9f0
[   76.429374] [c000000009a8fb70] [c00000000065dea0] iomap_dio_rw+0x20/0x80
[   76.429415] [c000000009a8fb90] [c008000000514eac] ext4_file_write_iter+0x814/0xdb8 [ext4]
[   76.429475] [c000000009a8fc60] [c000000000833aa0] io_write+0x280/0x540
[   76.429514] [c000000009a8fd80] [c0000000008183bc] io_issue_sqe+0xdc/0x390
[   76.429554] [c000000009a8fdd0] [c000000000818ab4] io_wq_submit_work+0x1c4/0x2c0
[   76.429601] [c000000009a8fe20] [c000000000835578] io_wq_single_worker+0x88/0xb0
[   76.429652] [c000000009a8fe50] [c00000000000df3c] ret_from_kernel_user_thread+0x14/0x1c

If you're not seeing an oops, that's an interesting difference that I can investigate further.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:00                                                         ` Timothy Pearson
@ 2023-11-09 17:17                                                           ` Jens Axboe
  2023-11-09 17:24                                                             ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-09 17:17 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 10:00 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 9:12:07 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 4:58 PM, Jens Axboe wrote:
>>> On 11/8/23 3:18 PM, Jens Axboe wrote:
>>>> OK, let me actually test this thing and see if I can make it solid
>>>> first...
>>>
>>> Here's a suitable hack - it just creates a new io worker for each item,
>>> ensuring that that worker is run on the same CPU.
>>
>> Turns out I didn't send you the current one, this one was older and
>> untested. Please try this one instead.
> 
> I'm still seeing an oops with this newer patchset applied:
> [   76.419788] Kernel attempted to read user page (9c) - exploit attempt? (uid: 0)
> [   76.419853] BUG: Kernel NULL pointer dereference on read at 0x0000009c
> [   76.419887] Faulting instruction address: 0xc0000000008374d4
> [   76.419919] Oops: Kernel access of bad area, sig: 11 [#1]
> <snip>
> [   76.429125] Call Trace:
> [   76.429142] [c000000009a8f800] [c000000000ea4e7c] schedule+0xfc/0x150 (unreliable)
> [   76.429190] [c000000009a8f870] [c000000000ead8f0] schedule_timeout+0x170/0x1e0
> [   76.429238] [c000000009a8f940] [c000000000ea3fc8] io_schedule_timeout+0x68/0xa0
> [   76.429285] [c000000009a8f970] [c0000000007adcbc] blk_io_schedule+0x3c/0x70
> [   76.429326] [c000000009a8f990] [c00000000065da84] __iomap_dio_rw+0x8d4/0x9f0
> [   76.429374] [c000000009a8fb70] [c00000000065dea0] iomap_dio_rw+0x20/0x80
> [   76.429415] [c000000009a8fb90] [c008000000514eac] ext4_file_write_iter+0x814/0xdb8 [ext4]
> [   76.429475] [c000000009a8fc60] [c000000000833aa0] io_write+0x280/0x540
> [   76.429514] [c000000009a8fd80] [c0000000008183bc] io_issue_sqe+0xdc/0x390
> [   76.429554] [c000000009a8fdd0] [c000000000818ab4] io_wq_submit_work+0x1c4/0x2c0
> [   76.429601] [c000000009a8fe20] [c000000000835578] io_wq_single_worker+0x88/0xb0
> [   76.429652] [c000000009a8fe50] [c00000000000df3c] ret_from_kernel_user_thread+0x14/0x1c
> 
> If you're not seeing an oops, that's an interesting difference that I
> can investigate further.

Are you sure that's with the latest patch? Because that looks like
something that'd happen with the buggy one I sent out, as the sleep/wake
handling doesn't properly handle the specialized workers.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:17                                                           ` Jens Axboe
@ 2023-11-09 17:24                                                             ` Timothy Pearson
  2023-11-09 17:30                                                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-09 17:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 11:17:03 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/9/23 10:00 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Thursday, November 9, 2023 9:12:07 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/8/23 4:58 PM, Jens Axboe wrote:
>>>> On 11/8/23 3:18 PM, Jens Axboe wrote:
>>>>> OK, let me actually test this thing and see if I can make it solid
>>>>> first...
>>>>
>>>> Here's a suitable hack - it just creates a new io worker for each item,
>>>> ensuring that that worker is run on the same CPU.
>>>
>>> Turns out I didn't send you the current one, this one was older and
>>> untested. Please try this one instead.
>> 
>> I'm still seeing an oops with this newer patchset applied:
>> [   76.419788] Kernel attempted to read user page (9c) - exploit attempt? (uid:
>> 0)
>> [   76.419853] BUG: Kernel NULL pointer dereference on read at 0x0000009c
>> [   76.419887] Faulting instruction address: 0xc0000000008374d4
>> [   76.419919] Oops: Kernel access of bad area, sig: 11 [#1]
>> <snip>
>> [   76.429125] Call Trace:
>> [   76.429142] [c000000009a8f800] [c000000000ea4e7c] schedule+0xfc/0x150
>> (unreliable)
>> [   76.429190] [c000000009a8f870] [c000000000ead8f0]
>> schedule_timeout+0x170/0x1e0
>> [   76.429238] [c000000009a8f940] [c000000000ea3fc8]
>> io_schedule_timeout+0x68/0xa0
>> [   76.429285] [c000000009a8f970] [c0000000007adcbc] blk_io_schedule+0x3c/0x70
>> [   76.429326] [c000000009a8f990] [c00000000065da84] __iomap_dio_rw+0x8d4/0x9f0
>> [   76.429374] [c000000009a8fb70] [c00000000065dea0] iomap_dio_rw+0x20/0x80
>> [   76.429415] [c000000009a8fb90] [c008000000514eac]
>> ext4_file_write_iter+0x814/0xdb8 [ext4]
>> [   76.429475] [c000000009a8fc60] [c000000000833aa0] io_write+0x280/0x540
>> [   76.429514] [c000000009a8fd80] [c0000000008183bc] io_issue_sqe+0xdc/0x390
>> [   76.429554] [c000000009a8fdd0] [c000000000818ab4]
>> io_wq_submit_work+0x1c4/0x2c0
>> [   76.429601] [c000000009a8fe20] [c000000000835578]
>> io_wq_single_worker+0x88/0xb0
>> [   76.429652] [c000000009a8fe50] [c00000000000df3c]
>> ret_from_kernel_user_thread+0x14/0x1c
>> 
>> If you're not seeing an oops, that's an interesting difference that I
>> can investigate further.
> 
> Are you sure that's with the latest patch? Because that looks like
> something that'd happen with the buggy one I sent out, as the sleep/wake
> handling doesn't properly handle the specialized workers.
> 
> --
> Jens Axboe

You're right, the new patch wasn't applied, somehow it didn't copy over from my clipboard.  Apologies for the noise.

That said, the new patch is still showing the data corruption.  Maybe the pinning was just introducing the same timing alterations that my udelay() does on specific kernel builds?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:24                                                             ` Timothy Pearson
@ 2023-11-09 17:30                                                               ` Jens Axboe
  2023-11-09 17:36                                                                 ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-09 17:30 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 10:24 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 11:17:03 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/9/23 10:00 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Thursday, November 9, 2023 9:12:07 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/8/23 4:58 PM, Jens Axboe wrote:
>>>>> On 11/8/23 3:18 PM, Jens Axboe wrote:
>>>>>> OK, let me actually test this thing and see if I can make it solid
>>>>>> first...
>>>>>
>>>>> Here's a suitable hack - it just creates a new io worker for each item,
>>>>> ensuring that that worker is run on the same CPU.
>>>>
>>>> Turns out I didn't send you the current one, this one was older and
>>>> untested. Please try this one instead.
>>>
>>> I'm still seeing an oops with this newer patchset applied:
>>> [   76.419788] Kernel attempted to read user page (9c) - exploit attempt? (uid:
>>> 0)
>>> [   76.419853] BUG: Kernel NULL pointer dereference on read at 0x0000009c
>>> [   76.419887] Faulting instruction address: 0xc0000000008374d4
>>> [   76.419919] Oops: Kernel access of bad area, sig: 11 [#1]
>>> <snip>
>>> [   76.429125] Call Trace:
>>> [   76.429142] [c000000009a8f800] [c000000000ea4e7c] schedule+0xfc/0x150
>>> (unreliable)
>>> [   76.429190] [c000000009a8f870] [c000000000ead8f0]
>>> schedule_timeout+0x170/0x1e0
>>> [   76.429238] [c000000009a8f940] [c000000000ea3fc8]
>>> io_schedule_timeout+0x68/0xa0
>>> [   76.429285] [c000000009a8f970] [c0000000007adcbc] blk_io_schedule+0x3c/0x70
>>> [   76.429326] [c000000009a8f990] [c00000000065da84] __iomap_dio_rw+0x8d4/0x9f0
>>> [   76.429374] [c000000009a8fb70] [c00000000065dea0] iomap_dio_rw+0x20/0x80
>>> [   76.429415] [c000000009a8fb90] [c008000000514eac]
>>> ext4_file_write_iter+0x814/0xdb8 [ext4]
>>> [   76.429475] [c000000009a8fc60] [c000000000833aa0] io_write+0x280/0x540
>>> [   76.429514] [c000000009a8fd80] [c0000000008183bc] io_issue_sqe+0xdc/0x390
>>> [   76.429554] [c000000009a8fdd0] [c000000000818ab4]
>>> io_wq_submit_work+0x1c4/0x2c0
>>> [   76.429601] [c000000009a8fe20] [c000000000835578]
>>> io_wq_single_worker+0x88/0xb0
>>> [   76.429652] [c000000009a8fe50] [c00000000000df3c]
>>> ret_from_kernel_user_thread+0x14/0x1c
>>>
>>> If you're not seeing an oops, that's an interesting difference that I
>>> can investigate further.
>>
>> Are you sure that's with the latest patch? Because that looks like
>> something that'd happen with the buggy one I sent out, as the sleep/wake
>> handling doesn't properly handle the specialized workers.
>>
>> --
>> Jens Axboe
> 
> You're right, the new patch wasn't applied, somehow it didn't copy
> over from my clipboard.  Apologies for the noise.
> 
> That said, the new patch is still showing the data corruption.  Maybe
> the pinning was just introducing the same timing alterations that my
> udelay() does on specific kernel builds?

Hmm ok, that is odd. How quickly does it trigger for you?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:30                                                               ` Jens Axboe
@ 2023-11-09 17:36                                                                 ` Timothy Pearson
  2023-11-09 17:38                                                                   ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-09 17:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 11:30:31 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/9/23 10:24 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Thursday, November 9, 2023 11:17:03 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/9/23 10:00 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Thursday, November 9, 2023 9:12:07 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/8/23 4:58 PM, Jens Axboe wrote:
>>>>>> On 11/8/23 3:18 PM, Jens Axboe wrote:
>>>>>>> OK, let me actually test this thing and see if I can make it solid
>>>>>>> first...
>>>>>>
>>>>>> Here's a suitable hack - it just creates a new io worker for each item,
>>>>>> ensuring that that worker is run on the same CPU.
>>>>>
>>>>> Turns out I didn't send you the current one, this one was older and
>>>>> untested. Please try this one instead.
>>>>
>>>> I'm still seeing an oops with this newer patchset applied:
>>>> [   76.419788] Kernel attempted to read user page (9c) - exploit attempt? (uid:
>>>> 0)
>>>> [   76.419853] BUG: Kernel NULL pointer dereference on read at 0x0000009c
>>>> [   76.419887] Faulting instruction address: 0xc0000000008374d4
>>>> [   76.419919] Oops: Kernel access of bad area, sig: 11 [#1]
>>>> <snip>
>>>> [   76.429125] Call Trace:
>>>> [   76.429142] [c000000009a8f800] [c000000000ea4e7c] schedule+0xfc/0x150
>>>> (unreliable)
>>>> [   76.429190] [c000000009a8f870] [c000000000ead8f0]
>>>> schedule_timeout+0x170/0x1e0
>>>> [   76.429238] [c000000009a8f940] [c000000000ea3fc8]
>>>> io_schedule_timeout+0x68/0xa0
>>>> [   76.429285] [c000000009a8f970] [c0000000007adcbc] blk_io_schedule+0x3c/0x70
>>>> [   76.429326] [c000000009a8f990] [c00000000065da84] __iomap_dio_rw+0x8d4/0x9f0
>>>> [   76.429374] [c000000009a8fb70] [c00000000065dea0] iomap_dio_rw+0x20/0x80
>>>> [   76.429415] [c000000009a8fb90] [c008000000514eac]
>>>> ext4_file_write_iter+0x814/0xdb8 [ext4]
>>>> [   76.429475] [c000000009a8fc60] [c000000000833aa0] io_write+0x280/0x540
>>>> [   76.429514] [c000000009a8fd80] [c0000000008183bc] io_issue_sqe+0xdc/0x390
>>>> [   76.429554] [c000000009a8fdd0] [c000000000818ab4]
>>>> io_wq_submit_work+0x1c4/0x2c0
>>>> [   76.429601] [c000000009a8fe20] [c000000000835578]
>>>> io_wq_single_worker+0x88/0xb0
>>>> [   76.429652] [c000000009a8fe50] [c00000000000df3c]
>>>> ret_from_kernel_user_thread+0x14/0x1c
>>>>
>>>> If you're not seeing an oops, that's an interesting difference that I
>>>> can investigate further.
>>>
>>> Are you sure that's with the latest patch? Because that looks like
>>> something that'd happen with the buggy one I sent out, as the sleep/wake
>>> handling doesn't properly handle the specialized workers.
>>>
>>> --
>>> Jens Axboe
>> 
>> You're right, the new patch wasn't applied, somehow it didn't copy
>> over from my clipboard.  Apologies for the noise.
>> 
>> That said, the new patch is still showing the data corruption.  Maybe
>> the pinning was just introducing the same timing alterations that my
>> udelay() does on specific kernel builds?
> 
> Hmm ok, that is odd. How quickly does it trigger for you?

Almost immediately.  Third pass in this case, which is pretty typical -- without any delays etc. it will normally fail in under 10 iterations.  When I add other delays in e.g. the worker thread it becomes less likely to fail, but still fails.   Only adding the udelay in the one specific (IRQ locked) location allows it to pass all 200 iterations.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:36                                                                 ` Timothy Pearson
@ 2023-11-09 17:38                                                                   ` Jens Axboe
  2023-11-09 17:42                                                                     ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-09 17:38 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 10:36 AM, Timothy Pearson wrote:
>> Hmm ok, that is odd. How quickly does it trigger for you?
> 
> Almost immediately.  Third pass in this case, which is pretty typical
> -- without any delays etc. it will normally fail in under 10
> iterations.  When I add other delays in e.g. the worker thread it
> becomes less likely to fail, but still fails.   Only adding the udelay
> in the one specific (IRQ locked) location allows it to pass all 200
> iterations.

What specific IRQ locked location? Hard to know where you are adding
these things without seeing the patches.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:38                                                                   ` Jens Axboe
@ 2023-11-09 17:42                                                                     ` Timothy Pearson
  2023-11-09 17:45                                                                       ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-09 17:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 11:38:10 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/9/23 10:36 AM, Timothy Pearson wrote:
>>> Hmm ok, that is odd. How quickly does it trigger for you?
>> 
>> Almost immediately.  Third pass in this case, which is pretty typical
>> -- without any delays etc. it will normally fail in under 10
>> iterations.  When I add other delays in e.g. the worker thread it
>> becomes less likely to fail, but still fails.   Only adding the udelay
>> in the one specific (IRQ locked) location allows it to pass all 200
>> iterations.
> 
> What specific IRQ locked location? Hard to know where you are adding
> these things without seeing the patches.
> 
> --
> Jens Axboe

Sorry about that.  Here's the patch that automagically makes things work (bear in mind line numbers etc. are munged since I have a bunch of other debug stuff in here at this point):

 static void io_wqe_dec_running(struct io_worker *worker)
@@ -312,19 +326,39 @@ static void io_wqe_dec_running(struct io_worker *worker)
        if (!(worker->flags & IO_WORKER_F_UP))
                return;

+       udelay(1000);
        if (atomic_dec_and_test(&acct->nr_running) && io_wqe_run_queue(wqe)) {
                atomic_inc(&acct->nr_running);
                atomic_inc(&wqe->wq->worker_refs);

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:42                                                                     ` Timothy Pearson
@ 2023-11-09 17:45                                                                       ` Jens Axboe
  2023-11-09 18:20                                                                         ` tpearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-09 17:45 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 10:42 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 11:38:10 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/9/23 10:36 AM, Timothy Pearson wrote:
>>>> Hmm ok, that is odd. How quickly does it trigger for you?
>>>
>>> Almost immediately.  Third pass in this case, which is pretty typical
>>> -- without any delays etc. it will normally fail in under 10
>>> iterations.  When I add other delays in e.g. the worker thread it
>>> becomes less likely to fail, but still fails.   Only adding the udelay
>>> in the one specific (IRQ locked) location allows it to pass all 200
>>> iterations.
>>
>> What specific IRQ locked location? Hard to know where you are adding
>> these things without seeing the patches.
>>
>> --
>> Jens Axboe
> 
> Sorry about that.  Here's the patch that automagically makes things
> work (bear in mind line numbers etc. are munged since I have a bunch
> of other debug stuff in here at this point):
> 
>  static void io_wqe_dec_running(struct io_worker *worker)
> @@ -312,19 +326,39 @@ static void io_wqe_dec_running(struct io_worker *worker)
>         if (!(worker->flags & IO_WORKER_F_UP))
>                 return;
> 
> +       udelay(1000);
>         if (atomic_dec_and_test(&acct->nr_running) && io_wqe_run_queue(wqe)) {
>                 atomic_inc(&acct->nr_running);
>                 atomic_inc(&wqe->wq->worker_refs);

Ah I see, you're on older stable? What specifically?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 17:45                                                                       ` Jens Axboe
@ 2023-11-09 18:20                                                                         ` tpearson
  2023-11-10  3:51                                                                           ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: tpearson @ 2023-11-09 18:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov

> On 11/9/23 10:42 AM, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Thursday, November 9, 2023 11:38:10 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/9/23 10:36 AM, Timothy Pearson wrote:
>>>>> Hmm ok, that is odd. How quickly does it trigger for you?
>>>>
>>>> Almost immediately.  Third pass in this case, which is pretty typical
>>>> -- without any delays etc. it will normally fail in under 10
>>>> iterations.  When I add other delays in e.g. the worker thread it
>>>> becomes less likely to fail, but still fails.   Only adding the udelay
>>>> in the one specific (IRQ locked) location allows it to pass all 200
>>>> iterations.
>>>
>>> What specific IRQ locked location? Hard to know where you are adding
>>> these things without seeing the patches.
>>>
>>> --
>>> Jens Axboe
>>
>> Sorry about that.  Here's the patch that automagically makes things
>> work (bear in mind line numbers etc. are munged since I have a bunch
>> of other debug stuff in here at this point):
>>
>>  static void io_wqe_dec_running(struct io_worker *worker)
>> @@ -312,19 +326,39 @@ static void io_wqe_dec_running(struct io_worker
>> *worker)
>>         if (!(worker->flags & IO_WORKER_F_UP))
>>                 return;
>>
>> +       udelay(1000);
>>         if (atomic_dec_and_test(&acct->nr_running) &&
>> io_wqe_run_queue(wqe)) {
>>                 atomic_inc(&acct->nr_running);
>>                 atomic_inc(&wqe->wq->worker_refs);
>
> Ah I see, you're on older stable? What specifically?

I run two test branches, one right after the issues showed up (5.12-rc7+),
and one basically at GIT master (6.6+).  I tend to test on 5.12 just so
that we're not inadvertently compounding issues, once we figure out what
is wrong on 5.12 then I can try to forward port to 6.6.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-09 18:20                                                                         ` tpearson
@ 2023-11-10  3:51                                                                           ` Jens Axboe
  2023-11-10  4:35                                                                             ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-10  3:51 UTC (permalink / raw)
  To: tpearson; +Cc: regressions, Pavel Begunkov

Just to go back to basics, can you try this one? It'll do the exact same
retry that io-wq is doing, just from the same task itself. If this
fails, then something core is wrong. I don't think it will, or we'd see
this on other platforms too of course. If this works, then it validates
that it's some oddity on ppc with punting this operation to a thread off
this main task.

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 64390d4e20c1..1d760570df04 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
-int io_write(struct io_kiocb *req, unsigned int issue_flags)
+static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 	struct io_rw_state __s, *s = &__s;
@@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+int io_write(struct io_kiocb *req, unsigned int issue_flags)
+{
+	int ret;
+
+	ret = __io_write(req, issue_flags);
+	if (ret != -EAGAIN)
+		return ret;
+
+	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
+	WARN_ON_ONCE(ret == -EAGAIN);
+	return ret;
+}
+
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-10  3:51                                                                           ` Jens Axboe
@ 2023-11-10  4:35                                                                             ` Timothy Pearson
  2023-11-10  6:48                                                                               ` Timothy Pearson
  2023-11-10 14:48                                                                               ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-10  4:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 9:51:09 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> Just to go back to basics, can you try this one? It'll do the exact same
> retry that io-wq is doing, just from the same task itself. If this
> fails, then something core is wrong. I don't think it will, or we'd see
> this on other platforms too of course. If this works, then it validates
> that it's some oddity on ppc with punting this operation to a thread off
> this main task.
> 
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 64390d4e20c1..1d760570df04 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
> issue_flags)
> 	return IOU_OK;
> }
> 
> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
> {
> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> 	struct io_rw_state __s, *s = &__s;
> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
> issue_flags)
> 	return ret;
> }
> 
> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	int ret;
> +
> +	ret = __io_write(req, issue_flags);
> +	if (ret != -EAGAIN)
> +		return ret;
> +
> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
> +	WARN_ON_ONCE(ret == -EAGAIN);
> +	return ret;
> +}
> +
> void io_rw_fail(struct io_kiocb *req)
> {
> 	int res;
> 

That does indeed "fix" the corruption issue.

Where is the punting actually taking place?  I can see at least one location but if it's a general issue with the punting process I should probably apply any test mitigations to all locations, and I'm not familiar enough with the codebase to be sure I've got them all...

Thanks!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-10  4:35                                                                             ` Timothy Pearson
@ 2023-11-10  6:48                                                                               ` Timothy Pearson
  2023-11-10 14:52                                                                                 ` Jens Axboe
  2023-11-10 14:48                                                                               ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-10  6:48 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Thursday, November 9, 2023 10:35:08 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 9:51:09 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> Just to go back to basics, can you try this one? It'll do the exact same
>> retry that io-wq is doing, just from the same task itself. If this
>> fails, then something core is wrong. I don't think it will, or we'd see
>> this on other platforms too of course. If this works, then it validates
>> that it's some oddity on ppc with punting this operation to a thread off
>> this main task.
>> 
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index 64390d4e20c1..1d760570df04 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>> issue_flags)
>> 	return IOU_OK;
>> }
>> 
>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>> {
>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> 	struct io_rw_state __s, *s = &__s;
>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>> issue_flags)
>> 	return ret;
>> }
>> 
>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	int ret;
>> +
>> +	ret = __io_write(req, issue_flags);
>> +	if (ret != -EAGAIN)
>> +		return ret;
>> +
>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>> +	WARN_ON_ONCE(ret == -EAGAIN);
>> +	return ret;
>> +}
>> +
>> void io_rw_fail(struct io_kiocb *req)
>> {
>> 	int res;
>> 
> 
> That does indeed "fix" the corruption issue.
> 
> Where is the punting actually taking place?  I can see at least one location but
> if it's a general issue with the punting process I should probably apply any
> test mitigations to all locations, and I'm not familiar enough with the
> codebase to be sure I've got them all...
> 
> Thanks!

I've been exploring a bunch of other possibilities, and one that has been slowly coalescing is whether we're triggering a bug somewhere else in the kernel.  Now that I know the io_write call is somehow related to this issue, I went back and went over some of the earlier logs, and might have found something.

When I enable KCSAN I sporadically see this type of race:

[ 1549.152381] ==================================================================
[ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
[ 1549.152609]
[ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on cpu 27:
[ 1549.153193]  dd_has_work+0x160/0x1b0
[ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
[ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
[ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
[ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
[ 1549.153556]  __blk_flush_plug+0x2bc/0x360
[ 1549.153622]  blk_finish_plug+0x60/0xa0
[ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
[ 1549.153759]  iomap_dio_rw+0x80/0xf0
[ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
[ 1549.154249]  io_write+0x4bc/0x900
[ 1549.154309]  io_issue_sqe+0x12c/0x5f0
[ 1549.154370]  io_submit_sqes+0xdd4/0x1050
[ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
[ 1549.154499]  system_call_exception+0x354/0x400
[ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
[ 1549.154651]
[ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
[ 1549.154757]  dd_insert_requests+0x81c/0xac0
[ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
[ 1549.154902]  __blk_flush_plug+0x2bc/0x360
[ 1549.154968]  blk_finish_plug+0x60/0xa0
[ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
[ 1549.155100]  iomap_dio_rw+0x80/0xf0
[ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
[ 1549.155563]  io_write+0x4bc/0x900
[ 1549.155606]  io_issue_sqe+0x12c/0x5f0
[ 1549.155648]  io_wq_submit_work+0x2e4/0x490
[ 1549.155692]  io_worker_handle_work+0xbac/0x1020
[ 1549.155745]  io_wq_worker+0x224/0x7b0
[ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
[ 1549.155841]
[ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
[ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
[ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[ 1549.156032] ==================================================================

Notably, the io_write calls are in the chain, and there were exactly two of these races and two test failures in the this test run.

io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.

Do yuo see similar races on x86 with this workload if you enable KCSAN?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-10  4:35                                                                             ` Timothy Pearson
  2023-11-10  6:48                                                                               ` Timothy Pearson
@ 2023-11-10 14:48                                                                               ` Jens Axboe
  1 sibling, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-10 14:48 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 9:35 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 9:51:09 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> Just to go back to basics, can you try this one? It'll do the exact same
>> retry that io-wq is doing, just from the same task itself. If this
>> fails, then something core is wrong. I don't think it will, or we'd see
>> this on other platforms too of course. If this works, then it validates
>> that it's some oddity on ppc with punting this operation to a thread off
>> this main task.
>>
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index 64390d4e20c1..1d760570df04 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>> issue_flags)
>> 	return IOU_OK;
>> }
>>
>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>> {
>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> 	struct io_rw_state __s, *s = &__s;
>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>> issue_flags)
>> 	return ret;
>> }
>>
>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	int ret;
>> +
>> +	ret = __io_write(req, issue_flags);
>> +	if (ret != -EAGAIN)
>> +		return ret;
>> +
>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>> +	WARN_ON_ONCE(ret == -EAGAIN);
>> +	return ret;
>> +}
>> +
>> void io_rw_fail(struct io_kiocb *req)
>> {
>> 	int res;
>>
> 
> That does indeed "fix" the corruption issue.
> 
> Where is the punting actually taking place?  I can see at least one
> location but if it's a general issue with the punting process I should
> probably apply any test mitigations to all locations, and I'm not
> familiar enough with the codebase to be sure I've got them all...

Usually io_write() would return -EAGAIN if it cannot perform the
operation nonblocking, in which case we'd ultimately end up in
io_req_task_submit() -> io_queue_iowq() -> io_wq_enqueue() and the
latter would insert it into the pending list for io-wq to process. I
don't think it's a general issue with punting, this happens for reads
too for example, and it seems things work fine if we just don't punt
writes. The wrong data is being written, which is why I keep suspecting
some page cache or cache aliasing issues here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-10  6:48                                                                               ` Timothy Pearson
@ 2023-11-10 14:52                                                                                 ` Jens Axboe
  2023-11-11 18:42                                                                                   ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-10 14:52 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/9/23 11:48 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "Jens Axboe" <axboe@kernel.dk>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Thursday, November 9, 2023 10:35:08 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> Just to go back to basics, can you try this one? It'll do the exact same
>>> retry that io-wq is doing, just from the same task itself. If this
>>> fails, then something core is wrong. I don't think it will, or we'd see
>>> this on other platforms too of course. If this works, then it validates
>>> that it's some oddity on ppc with punting this operation to a thread off
>>> this main task.
>>>
>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>> index 64390d4e20c1..1d760570df04 100644
>>> --- a/io_uring/rw.c
>>> +++ b/io_uring/rw.c
>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>> issue_flags)
>>> 	return IOU_OK;
>>> }
>>>
>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>> {
>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>> 	struct io_rw_state __s, *s = &__s;
>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>> issue_flags)
>>> 	return ret;
>>> }
>>>
>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = __io_write(req, issue_flags);
>>> +	if (ret != -EAGAIN)
>>> +		return ret;
>>> +
>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>> +	return ret;
>>> +}
>>> +
>>> void io_rw_fail(struct io_kiocb *req)
>>> {
>>> 	int res;
>>>
>>
>> That does indeed "fix" the corruption issue.
>>
>> Where is the punting actually taking place?  I can see at least one location but
>> if it's a general issue with the punting process I should probably apply any
>> test mitigations to all locations, and I'm not familiar enough with the
>> codebase to be sure I've got them all...
>>
>> Thanks!
> 
> I've been exploring a bunch of other possibilities, and one that has
> been slowly coalescing is whether we're triggering a bug somewhere
> else in the kernel.  Now that I know the io_write call is somehow
> related to this issue, I went back and went over some of the earlier
> logs, and might have found something.
> 
> When I enable KCSAN I sporadically see this type of race:
> 
> [ 1549.152381] ==================================================================
> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
> [ 1549.152609]
> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on cpu 27:
> [ 1549.153193]  dd_has_work+0x160/0x1b0
> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
> [ 1549.153622]  blk_finish_plug+0x60/0xa0
> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
> [ 1549.154249]  io_write+0x4bc/0x900
> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
> [ 1549.154499]  system_call_exception+0x354/0x400
> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
> [ 1549.154651]
> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
> [ 1549.154968]  blk_finish_plug+0x60/0xa0
> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
> [ 1549.155563]  io_write+0x4bc/0x900
> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
> [ 1549.155745]  io_wq_worker+0x224/0x7b0
> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
> [ 1549.155841]
> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
> [ 1549.156032] ==================================================================
> 
> Notably, the io_write calls are in the chain, and there were exactly
> two of these races and two test failures in the this test run.
> 
> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.

The above race is would just cause hung IO in the IO scheduler, it would
not lead to corruption. The io_write() call would be call_write_iter(),
not sure where you get the other one from?

In any case, when I ran this test case last time, I just used /dev/shm/
as the backing store and it still hit. Not io scheduler would be
involved there.

> Do yuo see similar races on x86 with this workload if you enable KCSAN?

I haven't tried.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-10 14:52                                                                                 ` Jens Axboe
@ 2023-11-11 18:42                                                                                   ` Timothy Pearson
  2023-11-11 18:58                                                                                     ` Jens Axboe
  2023-11-11 21:57                                                                                     ` Timothy Pearson
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-11 18:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov

----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Friday, November 10, 2023 8:52:05 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> To: "Jens Axboe" <axboe@kernel.dk>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>> retry that io-wq is doing, just from the same task itself. If this
>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>> this on other platforms too of course. If this works, then it validates
>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>> this main task.
>>>>
>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>> index 64390d4e20c1..1d760570df04 100644
>>>> --- a/io_uring/rw.c
>>>> +++ b/io_uring/rw.c
>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>> issue_flags)
>>>> 	return IOU_OK;
>>>> }
>>>>
>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>> {
>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>> 	struct io_rw_state __s, *s = &__s;
>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>> issue_flags)
>>>> 	return ret;
>>>> }
>>>>
>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	ret = __io_write(req, issue_flags);
>>>> +	if (ret != -EAGAIN)
>>>> +		return ret;
>>>> +
>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> void io_rw_fail(struct io_kiocb *req)
>>>> {
>>>> 	int res;
>>>>
>>>
>>> That does indeed "fix" the corruption issue.
>>>
>>> Where is the punting actually taking place?  I can see at least one location but
>>> if it's a general issue with the punting process I should probably apply any
>>> test mitigations to all locations, and I'm not familiar enough with the
>>> codebase to be sure I've got them all...
>>>
>>> Thanks!
>> 
>> I've been exploring a bunch of other possibilities, and one that has
>> been slowly coalescing is whether we're triggering a bug somewhere
>> else in the kernel.  Now that I know the io_write call is somehow
>> related to this issue, I went back and went over some of the earlier
>> logs, and might have found something.
>> 
>> When I enable KCSAN I sporadically see this type of race:
>> 
>> [ 1549.152381]
>> ==================================================================
>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>> [ 1549.152609]
>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>> cpu 27:
>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>> [ 1549.154249]  io_write+0x4bc/0x900
>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>> [ 1549.154499]  system_call_exception+0x354/0x400
>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>> [ 1549.154651]
>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>> [ 1549.155563]  io_write+0x4bc/0x900
>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>> [ 1549.155841]
>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>> [ 1549.156032]
>> ==================================================================
>> 
>> Notably, the io_write calls are in the chain, and there were exactly
>> two of these races and two test failures in the this test run.
>> 
>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
> 
> The above race is would just cause hung IO in the IO scheduler, it would
> not lead to corruption. The io_write() call would be call_write_iter(),
> not sure where you get the other one from?
> 
> In any case, when I ran this test case last time, I just used /dev/shm/
> as the backing store and it still hit. Not io scheduler would be
> involved there.

Fair enough.  Was grasping at straws a bit that night.

Quick update on-list, it seems MariaDB uses io_uring for write then tries to go back and do a standard synchronous read.  The data is valid on-disk at some point after the read (i.e. after the process exits, the data is confirmed valid on-disk), but the read itself returns corrupt / stale / garbage data.  MariaDB is the only application I've seen that tries to mix io_uring and standard I/O operations on the same file, and this may be playing into the issues observed.

Investigation continues...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 18:42                                                                                   ` Timothy Pearson
@ 2023-11-11 18:58                                                                                     ` Jens Axboe
  2023-11-11 19:04                                                                                       ` Timothy Pearson
  2023-11-11 21:57                                                                                     ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-11 18:58 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/11/23 11:42 AM, Timothy Pearson wrote:
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Friday, November 10, 2023 8:52:05 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>> this on other platforms too of course. If this works, then it validates
>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>> this main task.
>>>>>
>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>> --- a/io_uring/rw.c
>>>>> +++ b/io_uring/rw.c
>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>> issue_flags)
>>>>> 	return IOU_OK;
>>>>> }
>>>>>
>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> {
>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>> issue_flags)
>>>>> 	return ret;
>>>>> }
>>>>>
>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> +{
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = __io_write(req, issue_flags);
>>>>> +	if (ret != -EAGAIN)
>>>>> +		return ret;
>>>>> +
>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>> {
>>>>> 	int res;
>>>>>
>>>>
>>>> That does indeed "fix" the corruption issue.
>>>>
>>>> Where is the punting actually taking place?  I can see at least one location but
>>>> if it's a general issue with the punting process I should probably apply any
>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>> codebase to be sure I've got them all...
>>>>
>>>> Thanks!
>>>
>>> I've been exploring a bunch of other possibilities, and one that has
>>> been slowly coalescing is whether we're triggering a bug somewhere
>>> else in the kernel.  Now that I know the io_write call is somehow
>>> related to this issue, I went back and went over some of the earlier
>>> logs, and might have found something.
>>>
>>> When I enable KCSAN I sporadically see this type of race:
>>>
>>> [ 1549.152381]
>>> ==================================================================
>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>> [ 1549.152609]
>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>> cpu 27:
>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>> [ 1549.154249]  io_write+0x4bc/0x900
>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>> [ 1549.154651]
>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>> [ 1549.155563]  io_write+0x4bc/0x900
>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>> [ 1549.155841]
>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>> [ 1549.156032]
>>> ==================================================================
>>>
>>> Notably, the io_write calls are in the chain, and there were exactly
>>> two of these races and two test failures in the this test run.
>>>
>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>>
>> The above race is would just cause hung IO in the IO scheduler, it would
>> not lead to corruption. The io_write() call would be call_write_iter(),
>> not sure where you get the other one from?
>>
>> In any case, when I ran this test case last time, I just used /dev/shm/
>> as the backing store and it still hit. Not io scheduler would be
>> involved there.
> 
> Fair enough.  Was grasping at straws a bit that night.
> 
> Quick update on-list, it seems MariaDB uses io_uring for write then
> tries to go back and do a standard synchronous read.  The data is
> valid on-disk at some point after the read (i.e. after the process
> exits, the data is confirmed valid on-disk), but the read itself
> returns corrupt / stale / garbage data.  MariaDB is the only
> application I've seen that tries to mix io_uring and standard I/O
> operations on the same file, and this may be playing into the issues
> observed.

Nope, it's fine to mix and match. You obviously cannot issue a read
(sync or otherwise) before the async write has completed and expect sane
results, and I strongly suspect this is what is going on here...

Either that, or a mix of buffered and O_DIRECT, which is also a recipe
for disaster if you expect consistency.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 18:58                                                                                     ` Jens Axboe
@ 2023-11-11 19:04                                                                                       ` Timothy Pearson
  2023-11-11 19:11                                                                                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-11 19:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Saturday, November 11, 2023 12:58:03 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/11/23 11:42 AM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Friday, November 10, 2023 8:52:05 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>>> this on other platforms too of course. If this works, then it validates
>>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>>> this main task.
>>>>>>
>>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>>> --- a/io_uring/rw.c
>>>>>> +++ b/io_uring/rw.c
>>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>>> issue_flags)
>>>>>> 	return IOU_OK;
>>>>>> }
>>>>>>
>>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>> {
>>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>>> issue_flags)
>>>>>> 	return ret;
>>>>>> }
>>>>>>
>>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>> +{
>>>>>> +	int ret;
>>>>>> +
>>>>>> +	ret = __io_write(req, issue_flags);
>>>>>> +	if (ret != -EAGAIN)
>>>>>> +		return ret;
>>>>>> +
>>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>>> +	return ret;
>>>>>> +}
>>>>>> +
>>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>>> {
>>>>>> 	int res;
>>>>>>
>>>>>
>>>>> That does indeed "fix" the corruption issue.
>>>>>
>>>>> Where is the punting actually taking place?  I can see at least one location but
>>>>> if it's a general issue with the punting process I should probably apply any
>>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>>> codebase to be sure I've got them all...
>>>>>
>>>>> Thanks!
>>>>
>>>> I've been exploring a bunch of other possibilities, and one that has
>>>> been slowly coalescing is whether we're triggering a bug somewhere
>>>> else in the kernel.  Now that I know the io_write call is somehow
>>>> related to this issue, I went back and went over some of the earlier
>>>> logs, and might have found something.
>>>>
>>>> When I enable KCSAN I sporadically see this type of race:
>>>>
>>>> [ 1549.152381]
>>>> ==================================================================
>>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>>> [ 1549.152609]
>>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>>> cpu 27:
>>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>> [ 1549.154249]  io_write+0x4bc/0x900
>>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>>> [ 1549.154651]
>>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>> [ 1549.155563]  io_write+0x4bc/0x900
>>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>>> [ 1549.155841]
>>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>>> [ 1549.156032]
>>>> ==================================================================
>>>>
>>>> Notably, the io_write calls are in the chain, and there were exactly
>>>> two of these races and two test failures in the this test run.
>>>>
>>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>>>
>>> The above race is would just cause hung IO in the IO scheduler, it would
>>> not lead to corruption. The io_write() call would be call_write_iter(),
>>> not sure where you get the other one from?
>>>
>>> In any case, when I ran this test case last time, I just used /dev/shm/
>>> as the backing store and it still hit. Not io scheduler would be
>>> involved there.
>> 
>> Fair enough.  Was grasping at straws a bit that night.
>> 
>> Quick update on-list, it seems MariaDB uses io_uring for write then
>> tries to go back and do a standard synchronous read.  The data is
>> valid on-disk at some point after the read (i.e. after the process
>> exits, the data is confirmed valid on-disk), but the read itself
>> returns corrupt / stale / garbage data.  MariaDB is the only
>> application I've seen that tries to mix io_uring and standard I/O
>> operations on the same file, and this may be playing into the issues
>> observed.
> 
> Nope, it's fine to mix and match. You obviously cannot issue a read
> (sync or otherwise) before the async write has completed and expect sane
> results, and I strongly suspect this is what is going on here...

Yep, agreed it's fine, just that most apps don't do that so we're in potentially less-tested territory. :)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 19:04                                                                                       ` Timothy Pearson
@ 2023-11-11 19:11                                                                                         ` Jens Axboe
  2023-11-11 19:15                                                                                           ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-11 19:11 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/11/23 12:04 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Saturday, November 11, 2023 12:58:03 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/11/23 11:42 AM, Timothy Pearson wrote:
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Friday, November 10, 2023 8:52:05 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>>>> this on other platforms too of course. If this works, then it validates
>>>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>>>> this main task.
>>>>>>>
>>>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>>>> --- a/io_uring/rw.c
>>>>>>> +++ b/io_uring/rw.c
>>>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>>>> issue_flags)
>>>>>>> 	return IOU_OK;
>>>>>>> }
>>>>>>>
>>>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> {
>>>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>>>> issue_flags)
>>>>>>> 	return ret;
>>>>>>> }
>>>>>>>
>>>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> +{
>>>>>>> +	int ret;
>>>>>>> +
>>>>>>> +	ret = __io_write(req, issue_flags);
>>>>>>> +	if (ret != -EAGAIN)
>>>>>>> +		return ret;
>>>>>>> +
>>>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>>>> +	return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>>>> {
>>>>>>> 	int res;
>>>>>>>
>>>>>>
>>>>>> That does indeed "fix" the corruption issue.
>>>>>>
>>>>>> Where is the punting actually taking place?  I can see at least one location but
>>>>>> if it's a general issue with the punting process I should probably apply any
>>>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>>>> codebase to be sure I've got them all...
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> I've been exploring a bunch of other possibilities, and one that has
>>>>> been slowly coalescing is whether we're triggering a bug somewhere
>>>>> else in the kernel.  Now that I know the io_write call is somehow
>>>>> related to this issue, I went back and went over some of the earlier
>>>>> logs, and might have found something.
>>>>>
>>>>> When I enable KCSAN I sporadically see this type of race:
>>>>>
>>>>> [ 1549.152381]
>>>>> ==================================================================
>>>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>>>> [ 1549.152609]
>>>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>>>> cpu 27:
>>>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>> [ 1549.154249]  io_write+0x4bc/0x900
>>>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>>>> [ 1549.154651]
>>>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>> [ 1549.155563]  io_write+0x4bc/0x900
>>>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>>>> [ 1549.155841]
>>>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>>>> [ 1549.156032]
>>>>> ==================================================================
>>>>>
>>>>> Notably, the io_write calls are in the chain, and there were exactly
>>>>> two of these races and two test failures in the this test run.
>>>>>
>>>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>>>>
>>>> The above race is would just cause hung IO in the IO scheduler, it would
>>>> not lead to corruption. The io_write() call would be call_write_iter(),
>>>> not sure where you get the other one from?
>>>>
>>>> In any case, when I ran this test case last time, I just used /dev/shm/
>>>> as the backing store and it still hit. Not io scheduler would be
>>>> involved there.
>>>
>>> Fair enough.  Was grasping at straws a bit that night.
>>>
>>> Quick update on-list, it seems MariaDB uses io_uring for write then
>>> tries to go back and do a standard synchronous read.  The data is
>>> valid on-disk at some point after the read (i.e. after the process
>>> exits, the data is confirmed valid on-disk), but the read itself
>>> returns corrupt / stale / garbage data.  MariaDB is the only
>>> application I've seen that tries to mix io_uring and standard I/O
>>> operations on the same file, and this may be playing into the issues
>>> observed.
>>
>> Nope, it's fine to mix and match. You obviously cannot issue a read
>> (sync or otherwise) before the async write has completed and expect sane
>> results, and I strongly suspect this is what is going on here...
> 
> Yep, agreed it's fine, just that most apps don't do that so we're in
> potentially less-tested territory. :)

I don't think this is true. If you're doing buffered IO, your
synchronization is the page cache. This is 100% true regardless of
whether you use read/pread or io_uring to do the read, only difference
is the delivery mechanism for the read. But if you do:

threadA				threadB
start via write(2)
				io_uring_enter()
					submit_read()
write(2) completes

or

io_uring_enter()
	submit_write()
				read(2)
	write completes

then it's obviously broken. This has nothing to do with io_uring, which
uses the page cache for IO just like any other buffered IO syscall that
would do IO, sync or async.

If I were to guess, mariadb considers the write stable when it's been
submitted. If the read, sync or async, is submitted right after that,
then it would be completely valid to return stale data as the write
isn't done yet. You're at the mercy of timing at that point, which may
be why this shows up as a regression from 5.10.158 to 5.10.162, as
timing likely changed with the switch from kthread to io-wq native
workers.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 19:11                                                                                         ` Jens Axboe
@ 2023-11-11 19:15                                                                                           ` Timothy Pearson
  2023-11-11 19:23                                                                                             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-11 19:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Saturday, November 11, 2023 1:11:14 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/11/23 12:04 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Saturday, November 11, 2023 12:58:03 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/11/23 11:42 AM, Timothy Pearson wrote:
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Friday, November 10, 2023 8:52:05 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>>>>> this on other platforms too of course. If this works, then it validates
>>>>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>>>>> this main task.
>>>>>>>>
>>>>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>>>>> --- a/io_uring/rw.c
>>>>>>>> +++ b/io_uring/rw.c
>>>>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>>>>> issue_flags)
>>>>>>>> 	return IOU_OK;
>>>>>>>> }
>>>>>>>>
>>>>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>> {
>>>>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>>>>> issue_flags)
>>>>>>>> 	return ret;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>> +{
>>>>>>>> +	int ret;
>>>>>>>> +
>>>>>>>> +	ret = __io_write(req, issue_flags);
>>>>>>>> +	if (ret != -EAGAIN)
>>>>>>>> +		return ret;
>>>>>>>> +
>>>>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>>>>> {
>>>>>>>> 	int res;
>>>>>>>>
>>>>>>>
>>>>>>> That does indeed "fix" the corruption issue.
>>>>>>>
>>>>>>> Where is the punting actually taking place?  I can see at least one location but
>>>>>>> if it's a general issue with the punting process I should probably apply any
>>>>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>>>>> codebase to be sure I've got them all...
>>>>>>>
>>>>>>> Thanks!
>>>>>>
>>>>>> I've been exploring a bunch of other possibilities, and one that has
>>>>>> been slowly coalescing is whether we're triggering a bug somewhere
>>>>>> else in the kernel.  Now that I know the io_write call is somehow
>>>>>> related to this issue, I went back and went over some of the earlier
>>>>>> logs, and might have found something.
>>>>>>
>>>>>> When I enable KCSAN I sporadically see this type of race:
>>>>>>
>>>>>> [ 1549.152381]
>>>>>> ==================================================================
>>>>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>>>>> [ 1549.152609]
>>>>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>>>>> cpu 27:
>>>>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>>>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>>>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>>>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>>>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>>>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>>>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>>>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>>>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>>>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>>> [ 1549.154249]  io_write+0x4bc/0x900
>>>>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>>>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>>>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>>>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>>>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>>>>> [ 1549.154651]
>>>>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>>>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>>>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>>>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>>>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>>>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>>>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>>>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>>> [ 1549.155563]  io_write+0x4bc/0x900
>>>>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>>>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>>>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>>>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>>>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>>>>> [ 1549.155841]
>>>>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>>>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>>>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>>>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>>>>> [ 1549.156032]
>>>>>> ==================================================================
>>>>>>
>>>>>> Notably, the io_write calls are in the chain, and there were exactly
>>>>>> two of these races and two test failures in the this test run.
>>>>>>
>>>>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>>>>>
>>>>> The above race is would just cause hung IO in the IO scheduler, it would
>>>>> not lead to corruption. The io_write() call would be call_write_iter(),
>>>>> not sure where you get the other one from?
>>>>>
>>>>> In any case, when I ran this test case last time, I just used /dev/shm/
>>>>> as the backing store and it still hit. Not io scheduler would be
>>>>> involved there.
>>>>
>>>> Fair enough.  Was grasping at straws a bit that night.
>>>>
>>>> Quick update on-list, it seems MariaDB uses io_uring for write then
>>>> tries to go back and do a standard synchronous read.  The data is
>>>> valid on-disk at some point after the read (i.e. after the process
>>>> exits, the data is confirmed valid on-disk), but the read itself
>>>> returns corrupt / stale / garbage data.  MariaDB is the only
>>>> application I've seen that tries to mix io_uring and standard I/O
>>>> operations on the same file, and this may be playing into the issues
>>>> observed.
>>>
>>> Nope, it's fine to mix and match. You obviously cannot issue a read
>>> (sync or otherwise) before the async write has completed and expect sane
>>> results, and I strongly suspect this is what is going on here...
>> 
>> Yep, agreed it's fine, just that most apps don't do that so we're in
>> potentially less-tested territory. :)
> 
> I don't think this is true. If you're doing buffered IO, your
> synchronization is the page cache. This is 100% true regardless of
> whether you use read/pread or io_uring to do the read, only difference
> is the delivery mechanism for the read. But if you do:
> 
> threadA				threadB
> start via write(2)
>				io_uring_enter()
>					submit_read()
> write(2) completes
> 
> or
> 
> io_uring_enter()
>	submit_write()
>				read(2)
>	write completes
> 
> then it's obviously broken.

Agreed.

> This has nothing to do with io_uring, which
> uses the page cache for IO just like any other buffered IO syscall that
> would do IO, sync or async.

Understood.  At this point I'm much more familiar with the io_uring write path, having been up and down it over the past few days, than the read path.

> If I were to guess, mariadb considers the write stable when it's been
> submitted. If the read, sync or async, is submitted right after that,
> then it would be completely valid to return stale data as the write
> isn't done yet. You're at the mercy of timing at that point, which may
> be why this shows up as a regression from 5.10.158 to 5.10.162, as
> timing likely changed with the switch from kthread to io-wq native
> workers.

That's something I need to figure out, and also why it only seems to hit ppc64 (though that could just be ppc64 being more likely to trigger the data race due to timing or similar than other arches are).  From what I can tell MariaDB does try to do an fsync() before read, but if I understand correctly that won't do much if the io_uring writes aren't actually completed first...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 19:15                                                                                           ` Timothy Pearson
@ 2023-11-11 19:23                                                                                             ` Jens Axboe
  0 siblings, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-11 19:23 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/11/23 12:15 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Saturday, November 11, 2023 1:11:14 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/11/23 12:04 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Saturday, November 11, 2023 12:58:03 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/11/23 11:42 AM, Timothy Pearson wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Friday, November 10, 2023 8:52:05 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>>>>>> this on other platforms too of course. If this works, then it validates
>>>>>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>>>>>> this main task.
>>>>>>>>>
>>>>>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>>>>>> --- a/io_uring/rw.c
>>>>>>>>> +++ b/io_uring/rw.c
>>>>>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>>>>>> issue_flags)
>>>>>>>>> 	return IOU_OK;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>>> {
>>>>>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>>>>>> issue_flags)
>>>>>>>>> 	return ret;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>>>> +{
>>>>>>>>> +	int ret;
>>>>>>>>> +
>>>>>>>>> +	ret = __io_write(req, issue_flags);
>>>>>>>>> +	if (ret != -EAGAIN)
>>>>>>>>> +		return ret;
>>>>>>>>> +
>>>>>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>>>>>> +	return ret;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>>>>>> {
>>>>>>>>> 	int res;
>>>>>>>>>
>>>>>>>>
>>>>>>>> That does indeed "fix" the corruption issue.
>>>>>>>>
>>>>>>>> Where is the punting actually taking place?  I can see at least one location but
>>>>>>>> if it's a general issue with the punting process I should probably apply any
>>>>>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>>>>>> codebase to be sure I've got them all...
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>
>>>>>>> I've been exploring a bunch of other possibilities, and one that has
>>>>>>> been slowly coalescing is whether we're triggering a bug somewhere
>>>>>>> else in the kernel.  Now that I know the io_write call is somehow
>>>>>>> related to this issue, I went back and went over some of the earlier
>>>>>>> logs, and might have found something.
>>>>>>>
>>>>>>> When I enable KCSAN I sporadically see this type of race:
>>>>>>>
>>>>>>> [ 1549.152381]
>>>>>>> ==================================================================
>>>>>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>>>>>> [ 1549.152609]
>>>>>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>>>>>> cpu 27:
>>>>>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>>>>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>>>>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>>>>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>>>>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>>>>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>>>>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>>>>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>>>>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>>>>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>>>> [ 1549.154249]  io_write+0x4bc/0x900
>>>>>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>>>>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>>>>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>>>>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>>>>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>>>>>> [ 1549.154651]
>>>>>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>>>>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>>>>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>>>>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>>>>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>>>>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>>>>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>>>>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>>>>>> [ 1549.155563]  io_write+0x4bc/0x900
>>>>>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>>>>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>>>>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>>>>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>>>>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>>>>>> [ 1549.155841]
>>>>>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>>>>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>>>>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>>>>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>>>>>> [ 1549.156032]
>>>>>>> ==================================================================
>>>>>>>
>>>>>>> Notably, the io_write calls are in the chain, and there were exactly
>>>>>>> two of these races and two test failures in the this test run.
>>>>>>>
>>>>>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>>>>>>
>>>>>> The above race is would just cause hung IO in the IO scheduler, it would
>>>>>> not lead to corruption. The io_write() call would be call_write_iter(),
>>>>>> not sure where you get the other one from?
>>>>>>
>>>>>> In any case, when I ran this test case last time, I just used /dev/shm/
>>>>>> as the backing store and it still hit. Not io scheduler would be
>>>>>> involved there.
>>>>>
>>>>> Fair enough.  Was grasping at straws a bit that night.
>>>>>
>>>>> Quick update on-list, it seems MariaDB uses io_uring for write then
>>>>> tries to go back and do a standard synchronous read.  The data is
>>>>> valid on-disk at some point after the read (i.e. after the process
>>>>> exits, the data is confirmed valid on-disk), but the read itself
>>>>> returns corrupt / stale / garbage data.  MariaDB is the only
>>>>> application I've seen that tries to mix io_uring and standard I/O
>>>>> operations on the same file, and this may be playing into the issues
>>>>> observed.
>>>>
>>>> Nope, it's fine to mix and match. You obviously cannot issue a read
>>>> (sync or otherwise) before the async write has completed and expect sane
>>>> results, and I strongly suspect this is what is going on here...
>>>
>>> Yep, agreed it's fine, just that most apps don't do that so we're in
>>> potentially less-tested territory. :)
>>
>> I don't think this is true. If you're doing buffered IO, your
>> synchronization is the page cache. This is 100% true regardless of
>> whether you use read/pread or io_uring to do the read, only difference
>> is the delivery mechanism for the read. But if you do:
>>
>> threadA				threadB
>> start via write(2)
>> 				io_uring_enter()
>> 					submit_read()
>> write(2) completes
>>
>> or
>>
>> io_uring_enter()
>> 	submit_write()
>> 				read(2)
>> 	write completes
>>
>> then it's obviously broken.
> 
> Agreed.
> 
>> This has nothing to do with io_uring, which
>> uses the page cache for IO just like any other buffered IO syscall that
>> would do IO, sync or async.
> 
> Understood.  At this point I'm much more familiar with the io_uring
> write path, having been up and down it over the past few days, than
> the read path.

There's not much difference between them, actually. The write side will
call ->write_iter() to do the write, and the read side ->read_iter().
The only real difference is that the read side has some logic for
buffered reads, where it relies on read-ahead to kick off IO for pages
and get a callback via io_async_buf_func() when the pages are unlocked
(eg IO to them is done).

>> If I were to guess, mariadb considers the write stable when it's been
>> submitted. If the read, sync or async, is submitted right after that,
>> then it would be completely valid to return stale data as the write
>> isn't done yet. You're at the mercy of timing at that point, which may
>> be why this shows up as a regression from 5.10.158 to 5.10.162, as
>> timing likely changed with the switch from kthread to io-wq native
>> workers.
> 
> That's something I need to figure out, and also why it only seems to
> hit ppc64 (though that could just be ppc64 being more likely to
> trigger the data race due to timing or similar than other arches are).
> From what I can tell MariaDB does try to do an fsync() before read,
> but if I understand correctly that won't do much if the io_uring
> writes aren't actually completed first...

It could just be timing which is causing it on ppc64, who knows. Maybe
once we fully understand the issue it'll become clear!

The fsync() may be fine as long as the write is actually started. But
that is not guaranteed if mariadb considers this the case after
io_uring_enter() has submitted it. IOW, you could see:

threadA				threadB
io_uring_enter()
	submit_write(fd)
		queue io-wq
io_uring_enter() returns
				fsync(fd) <- does nothing
io-wq
	submit_write(fd)
	write completes

It again boils down to when to consider the write completed, and that is
when the CQE is visible for it. No assumptions should be made about the
write start before that, not even that it may have been started after
io_uring_enter() returns. It may very well be started and even complete
at that point, but the application has no way of knowing that. When the
CQE is posted for the write, then it knows for a fact that the write is
done.

-- 
Jens Axboe
		

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 18:42                                                                                   ` Timothy Pearson
  2023-11-11 18:58                                                                                     ` Jens Axboe
@ 2023-11-11 21:57                                                                                     ` Timothy Pearson
  2023-11-13 17:06                                                                                       ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-11 21:57 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Saturday, November 11, 2023 12:42:39 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Friday, November 10, 2023 8:52:05 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/9/23 11:48 PM, Timothy Pearson wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Thursday, November 9, 2023 10:35:08 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>> 
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Thursday, November 9, 2023 9:51:09 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> Just to go back to basics, can you try this one? It'll do the exact same
>>>>> retry that io-wq is doing, just from the same task itself. If this
>>>>> fails, then something core is wrong. I don't think it will, or we'd see
>>>>> this on other platforms too of course. If this works, then it validates
>>>>> that it's some oddity on ppc with punting this operation to a thread off
>>>>> this main task.
>>>>>
>>>>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>>>>> index 64390d4e20c1..1d760570df04 100644
>>>>> --- a/io_uring/rw.c
>>>>> +++ b/io_uring/rw.c
>>>>> @@ -968,7 +968,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int
>>>>> issue_flags)
>>>>> 	return IOU_OK;
>>>>> }
>>>>>
>>>>> -int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> +static int __io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> {
>>>>> 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>>>> 	struct io_rw_state __s, *s = &__s;
>>>>> @@ -1092,6 +1092,19 @@ int io_write(struct io_kiocb *req, unsigned int
>>>>> issue_flags)
>>>>> 	return ret;
>>>>> }
>>>>>
>>>>> +int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>>>> +{
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = __io_write(req, issue_flags);
>>>>> +	if (ret != -EAGAIN)
>>>>> +		return ret;
>>>>> +
>>>>> +	ret = __io_write(req, issue_flags & ~IO_URING_F_NONBLOCK);
>>>>> +	WARN_ON_ONCE(ret == -EAGAIN);
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> void io_rw_fail(struct io_kiocb *req)
>>>>> {
>>>>> 	int res;
>>>>>
>>>>
>>>> That does indeed "fix" the corruption issue.
>>>>
>>>> Where is the punting actually taking place?  I can see at least one location but
>>>> if it's a general issue with the punting process I should probably apply any
>>>> test mitigations to all locations, and I'm not familiar enough with the
>>>> codebase to be sure I've got them all...
>>>>
>>>> Thanks!
>>> 
>>> I've been exploring a bunch of other possibilities, and one that has
>>> been slowly coalescing is whether we're triggering a bug somewhere
>>> else in the kernel.  Now that I know the io_write call is somehow
>>> related to this issue, I went back and went over some of the earlier
>>> logs, and might have found something.
>>> 
>>> When I enable KCSAN I sporadically see this type of race:
>>> 
>>> [ 1549.152381]
>>> ==================================================================
>>> [ 1549.152515] BUG: KCSAN: data-race in dd_has_work / dd_insert_requests
>>> [ 1549.152609]
>>> [ 1549.152644] read (marked) to 0xc0000000080c2e98 of 8 bytes by task 3372 on
>>> cpu 27:
>>> [ 1549.153193]  dd_has_work+0x160/0x1b0
>>> [ 1549.153259]  __blk_mq_sched_dispatch_requests+0x42c/0xdf0
>>> [ 1549.153331]  blk_mq_sched_dispatch_requests+0xe4/0x120
>>> [ 1549.153403]  blk_mq_run_hw_queue+0x358/0x390
>>> [ 1549.153479]  blk_mq_flush_plug_list+0x8fc/0xea0
>>> [ 1549.153556]  __blk_flush_plug+0x2bc/0x360
>>> [ 1549.153622]  blk_finish_plug+0x60/0xa0
>>> [ 1549.153689]  __iomap_dio_rw+0xd28/0x1140
>>> [ 1549.153759]  iomap_dio_rw+0x80/0xf0
>>> [ 1549.153825]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>> [ 1549.154249]  io_write+0x4bc/0x900
>>> [ 1549.154309]  io_issue_sqe+0x12c/0x5f0
>>> [ 1549.154370]  io_submit_sqes+0xdd4/0x1050
>>> [ 1549.154429]  sys_io_uring_enter+0x344/0x15d0
>>> [ 1549.154499]  system_call_exception+0x354/0x400
>>> [ 1549.154569]  system_call_vectored_common+0x15c/0x2ec
>>> [ 1549.154651]
>>> [ 1549.154685] write to 0xc0000000080c2e98 of 8 bytes by task 3667 on cpu 32:
>>> [ 1549.154757]  dd_insert_requests+0x81c/0xac0
>>> [ 1549.154825]  blk_mq_flush_plug_list+0x8ec/0xea0
>>> [ 1549.154902]  __blk_flush_plug+0x2bc/0x360
>>> [ 1549.154968]  blk_finish_plug+0x60/0xa0
>>> [ 1549.155034]  __iomap_dio_rw+0xd28/0x1140
>>> [ 1549.155100]  iomap_dio_rw+0x80/0xf0
>>> [ 1549.155166]  ext4_file_write_iter+0x9f8/0xff0 [ext4]
>>> [ 1549.155563]  io_write+0x4bc/0x900
>>> [ 1549.155606]  io_issue_sqe+0x12c/0x5f0
>>> [ 1549.155648]  io_wq_submit_work+0x2e4/0x490
>>> [ 1549.155692]  io_worker_handle_work+0xbac/0x1020
>>> [ 1549.155745]  io_wq_worker+0x224/0x7b0
>>> [ 1549.155792]  ret_from_kernel_user_thread+0x14/0x1c
>>> [ 1549.155841]
>>> [ 1549.155864] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1549.155904] CPU: 32 PID: 3667 Comm: iou-wrk-3372 Not tainted 6.6.0+ #10
>>> [ 1549.155961] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw)
>>> 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
>>> [ 1549.156032]
>>> ==================================================================
>>> 
>>> Notably, the io_write calls are in the chain, and there were exactly
>>> two of these races and two test failures in the this test run.
>>> 
>>> io_write+0x4bc seems to be WRITE_ONCE(list->first, last->next) in wq_list_cut.
>> 
>> The above race is would just cause hung IO in the IO scheduler, it would
>> not lead to corruption. The io_write() call would be call_write_iter(),
>> not sure where you get the other one from?
>> 
>> In any case, when I ran this test case last time, I just used /dev/shm/
>> as the backing store and it still hit. Not io scheduler would be
>> involved there.
> 
> Fair enough.  Was grasping at straws a bit that night.
> 
> Quick update on-list, it seems MariaDB uses io_uring for write then tries to go
> back and do a standard synchronous read.  The data is valid on-disk at some
> point after the read (i.e. after the process exits, the data is confirmed valid
> on-disk), but the read itself returns corrupt / stale / garbage data.  MariaDB
> is the only application I've seen that tries to mix io_uring and standard I/O
> operations on the same file, and this may be playing into the issues observed.
> 
> Investigation continues...

Unfortunately I got led down a rabbit hole here.  With the tests I was running, MariaDB writes the encrypted data separately from the normal un-encrypted page header and checksums, and the internal encrpyted data on disk was corrupt while the outer page checksum was still valid.

I have since switched to the main.xa_prepared_binlog_off test, which shows the corruption more easily in the on-disk format.  We are still apparently dealing with a write path issue, which makes more sense given the nature of the corruption observed on production systems.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-11 21:57                                                                                     ` Timothy Pearson
@ 2023-11-13 17:06                                                                                       ` Timothy Pearson
  2023-11-13 17:39                                                                                         ` Jens Axboe
  2023-11-13 20:47                                                                                         ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 17:06 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov

----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
> <asml.silence@gmail.com>
> Sent: Saturday, November 11, 2023 3:57:23 PM
> Subject: Re: Regression in io_uring, leading to data corruption
> 
> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
> MariaDB writes the encrypted data separately from the normal un-encrypted page
> header and checksums, and the internal encrpyted data on disk was corrupt while
> the outer page checksum was still valid.
> 
> I have since switched to the main.xa_prepared_binlog_off test, which shows the
> corruption more easily in the on-disk format.  We are still apparently dealing
> with a write path issue, which makes more sense given the nature of the
> corruption observed on production systems.

Quick status update -- after considerable effort applied I've managed to narrow down what is going wrong, but still need to locate the root cause.  The bug is incredibly timing-dependent, therefore it is difficult to instrument the code paths I need without causing it to disappear.

What we're dealing with is a wild write to RAM of some sort, provoked by the exact timing of some of the encryption tests in mariadb.  I've caught the wild write a few times now, it is not in the standard io_uring write path but instead appears to be triggered (somehow) by the io worker punting process.

When the bug is hit, and if all other conditions are exactly correct, *something* (still to be identified) writes 32 bytes of gibberish into one of the mariadb in-RAM database pages at a random offset.  This wild write occurs right before the page is encrypted for write to disk via io_uring.  I have confirmed that the post-encryption data in RAM is written to disk without any additional corruption, and is then read back out from disk into the page verification routine also without any additional corruption.  The page verification routine decrypts the data from disk, thus restoring the decrypted data that contains the wild write data stamped somewhere on it, where we then hit the corruption warning and halt the test run.

Irritatingly, if I try to instrument the data flow in the application right before the encryption routine, the bug disappears (or, more precisely, is masked).  If I had to guess from these symptoms, I'd suspect the application io worker thread is waking up, grabbing wrong context from somewhere, and scribbling some kind of status data into memory, which rarely ends up being on top of one of the in-RAM database pages.  This could be an application issue or a kernel issue, I'm not sure yet, but given the precise timing requirements I'm less and less surprised this is only showing on ppc64 right now.

As always, investigation continues...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 17:06                                                                                       ` Timothy Pearson
@ 2023-11-13 17:39                                                                                         ` Jens Axboe
  2023-11-13 19:02                                                                                           ` Timothy Pearson
  2023-11-13 20:47                                                                                         ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-13 17:39 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 10:06 AM, Timothy Pearson wrote:
> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Saturday, November 11, 2023 3:57:23 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>> header and checksums, and the internal encrpyted data on disk was corrupt while
>> the outer page checksum was still valid.
>>
>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>> corruption more easily in the on-disk format.  We are still apparently dealing
>> with a write path issue, which makes more sense given the nature of the
>> corruption observed on production systems.
> 
> Quick status update -- after considerable effort applied I've managed
> to narrow down what is going wrong, but still need to locate the root
> cause.  The bug is incredibly timing-dependent, therefore it is
> difficult to instrument the code paths I need without causing it to
> disappear.
> 
> What we're dealing with is a wild write to RAM of some sort, provoked
> by the exact timing of some of the encryption tests in mariadb.  I've
> caught the wild write a few times now, it is not in the standard
> io_uring write path but instead appears to be triggered (somehow) by
> the io worker punting process.
> 
> When the bug is hit, and if all other conditions are exactly correct,
> *something* (still to be identified) writes 32 bytes of gibberish into
> one of the mariadb in-RAM database pages at a random offset.  This
> wild write occurs right before the page is encrypted for write to disk
> via io_uring.  I have confirmed that the post-encryption data in RAM
> is written to disk without any additional corruption, and is then read
> back out from disk into the page verification routine also without any
> additional corruption.  The page verification routine decrypts the
> data from disk, thus restoring the decrypted data that contains the
> wild write data stamped somewhere on it, where we then hit the
> corruption warning and halt the test run.
> 
> Irritatingly, if I try to instrument the data flow in the application
> right before the encryption routine, the bug disappears (or, more
> precisely, is masked).  If I had to guess from these symptoms, I'd
> suspect the application io worker thread is waking up, grabbing wrong
> context from somewhere, and scribbling some kind of status data into
> memory, which rarely ends up being on top of one of the in-RAM
> database pages.  This could be an application issue or a kernel issue,
> I'm not sure yet, but given the precise timing requirements I'm less
> and less surprised this is only showing on ppc64 right now.
> 
> As always, investigation continues...

I wonder if this has to do with copy_thread() on powerpc - so not
necessarily ppc memory ordering related, but just something in the arch
specific copy section.

I took a look back, and the initial change actually forgot ppc. Since
then, there's been an attempt to make this generic:

commit 5bd2e97c868a8a44470950ed01846cab6328e540
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Apr 12 10:18:48 2022 -0500

    fork: Generalize PF_IO_WORKER handling

and later a powerpc change related to that too:

commit eed7c420aac7fde5e5915d2747c3ebbbda225835
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Sat Mar 25 22:29:01 2023 +1000

    powerpc: copy_thread differentiate kthreads and user mode threads

Just stabbing in the dark a bit here as I won't pretend to understand
the finer details of powerpc thread creation, but maybe try with this
and see if it makes any difference.

As you note in your reply, we could very well be corrupting some bytes
somewhere every time. We just only notice quickly when it happens to be
in that specific buffer.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 392404688cec..d4dec2fd091c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 
 	klp_init_thread_info(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 
 		/* Create initial minimum stack frame. */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 17:39                                                                                         ` Jens Axboe
@ 2023-11-13 19:02                                                                                           ` Timothy Pearson
  2023-11-13 19:29                                                                                             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 19:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 11:39:30 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>> "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>> the outer page checksum was still valid.
>>>
>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>> with a write path issue, which makes more sense given the nature of the
>>> corruption observed on production systems.
>> 
>> Quick status update -- after considerable effort applied I've managed
>> to narrow down what is going wrong, but still need to locate the root
>> cause.  The bug is incredibly timing-dependent, therefore it is
>> difficult to instrument the code paths I need without causing it to
>> disappear.
>> 
>> What we're dealing with is a wild write to RAM of some sort, provoked
>> by the exact timing of some of the encryption tests in mariadb.  I've
>> caught the wild write a few times now, it is not in the standard
>> io_uring write path but instead appears to be triggered (somehow) by
>> the io worker punting process.
>> 
>> When the bug is hit, and if all other conditions are exactly correct,
>> *something* (still to be identified) writes 32 bytes of gibberish into
>> one of the mariadb in-RAM database pages at a random offset.  This
>> wild write occurs right before the page is encrypted for write to disk
>> via io_uring.  I have confirmed that the post-encryption data in RAM
>> is written to disk without any additional corruption, and is then read
>> back out from disk into the page verification routine also without any
>> additional corruption.  The page verification routine decrypts the
>> data from disk, thus restoring the decrypted data that contains the
>> wild write data stamped somewhere on it, where we then hit the
>> corruption warning and halt the test run.
>> 
>> Irritatingly, if I try to instrument the data flow in the application
>> right before the encryption routine, the bug disappears (or, more
>> precisely, is masked).  If I had to guess from these symptoms, I'd
>> suspect the application io worker thread is waking up, grabbing wrong
>> context from somewhere, and scribbling some kind of status data into
>> memory, which rarely ends up being on top of one of the in-RAM
>> database pages.  This could be an application issue or a kernel issue,
>> I'm not sure yet, but given the precise timing requirements I'm less
>> and less surprised this is only showing on ppc64 right now.
>> 
>> As always, investigation continues...
> 
> I wonder if this has to do with copy_thread() on powerpc - so not
> necessarily ppc memory ordering related, but just something in the arch
> specific copy section.
> 
> I took a look back, and the initial change actually forgot ppc. Since
> then, there's been an attempt to make this generic:
> 
> commit 5bd2e97c868a8a44470950ed01846cab6328e540
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Tue Apr 12 10:18:48 2022 -0500
> 
>    fork: Generalize PF_IO_WORKER handling
> 
> and later a powerpc change related to that too:
> 
> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date:   Sat Mar 25 22:29:01 2023 +1000
> 
>    powerpc: copy_thread differentiate kthreads and user mode threads
> 
> Just stabbing in the dark a bit here as I won't pretend to understand
> the finer details of powerpc thread creation, but maybe try with this
> and see if it makes any difference.
> 
> As you note in your reply, we could very well be corrupting some bytes
> somewhere every time. We just only notice quickly when it happens to be
> in that specific buffer.
> 
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 392404688cec..d4dec2fd091c 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
> kernel_clone_args *args)
> 
> 	klp_init_thread_info(p);
> 
> -	if (unlikely(p->flags & PF_KTHREAD)) {
> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> 		/* kernel thread */
> 
> 		/* Create initial minimum stack frame. */

Good idea, but didn't work unfortunately.  Any other suggestions welcome while I continue to debug...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 19:02                                                                                           ` Timothy Pearson
@ 2023-11-13 19:29                                                                                             ` Jens Axboe
  2023-11-13 20:58                                                                                               ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-13 19:29 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 12:02 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 11:39:30 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>> "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>> the outer page checksum was still valid.
>>>>
>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>> with a write path issue, which makes more sense given the nature of the
>>>> corruption observed on production systems.
>>>
>>> Quick status update -- after considerable effort applied I've managed
>>> to narrow down what is going wrong, but still need to locate the root
>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>> difficult to instrument the code paths I need without causing it to
>>> disappear.
>>>
>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>> caught the wild write a few times now, it is not in the standard
>>> io_uring write path but instead appears to be triggered (somehow) by
>>> the io worker punting process.
>>>
>>> When the bug is hit, and if all other conditions are exactly correct,
>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>> one of the mariadb in-RAM database pages at a random offset.  This
>>> wild write occurs right before the page is encrypted for write to disk
>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>> is written to disk without any additional corruption, and is then read
>>> back out from disk into the page verification routine also without any
>>> additional corruption.  The page verification routine decrypts the
>>> data from disk, thus restoring the decrypted data that contains the
>>> wild write data stamped somewhere on it, where we then hit the
>>> corruption warning and halt the test run.
>>>
>>> Irritatingly, if I try to instrument the data flow in the application
>>> right before the encryption routine, the bug disappears (or, more
>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>> suspect the application io worker thread is waking up, grabbing wrong
>>> context from somewhere, and scribbling some kind of status data into
>>> memory, which rarely ends up being on top of one of the in-RAM
>>> database pages.  This could be an application issue or a kernel issue,
>>> I'm not sure yet, but given the precise timing requirements I'm less
>>> and less surprised this is only showing on ppc64 right now.
>>>
>>> As always, investigation continues...
>>
>> I wonder if this has to do with copy_thread() on powerpc - so not
>> necessarily ppc memory ordering related, but just something in the arch
>> specific copy section.
>>
>> I took a look back, and the initial change actually forgot ppc. Since
>> then, there's been an attempt to make this generic:
>>
>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>> Author: Eric W. Biederman <ebiederm@xmission.com>
>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>
>>    fork: Generalize PF_IO_WORKER handling
>>
>> and later a powerpc change related to that too:
>>
>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>> Author: Nicholas Piggin <npiggin@gmail.com>
>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>
>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>
>> Just stabbing in the dark a bit here as I won't pretend to understand
>> the finer details of powerpc thread creation, but maybe try with this
>> and see if it makes any difference.
>>
>> As you note in your reply, we could very well be corrupting some bytes
>> somewhere every time. We just only notice quickly when it happens to be
>> in that specific buffer.
>>
>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>> index 392404688cec..d4dec2fd091c 100644
>> --- a/arch/powerpc/kernel/process.c
>> +++ b/arch/powerpc/kernel/process.c
>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>> kernel_clone_args *args)
>>
>> 	klp_init_thread_info(p);
>>
>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>> 		/* kernel thread */
>>
>> 		/* Create initial minimum stack frame. */
> 
> Good idea, but didn't work unfortunately.  Any other suggestions
> welcome while I continue to debug...

I ponder if it's still in there... I don't see what else could be poking
and causing user memory corruption.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 17:06                                                                                       ` Timothy Pearson
  2023-11-13 17:39                                                                                         ` Jens Axboe
@ 2023-11-13 20:47                                                                                         ` Jens Axboe
  2023-11-13 21:08                                                                                           ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-13 20:47 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 10:06 AM, Timothy Pearson wrote:
> When the bug is hit, and if all other conditions are exactly correct,
> *something* (still to be identified) writes 32 bytes of gibberish into
> one of the mariadb in-RAM database pages at a random offset.

Do you have one or more examples of what those 32 bytes look like?
Ideally more than one. Would be nice if we could deduce something from
the content there. A long shot, but...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 19:29                                                                                             ` Jens Axboe
@ 2023-11-13 20:58                                                                                               ` Timothy Pearson
  2023-11-13 21:22                                                                                                 ` Timothy Pearson
  2023-11-13 22:15                                                                                                 ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 20:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 1:29:38 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>> ----- Original Message -----
>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>> "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>> the outer page checksum was still valid.
>>>>>
>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>> with a write path issue, which makes more sense given the nature of the
>>>>> corruption observed on production systems.
>>>>
>>>> Quick status update -- after considerable effort applied I've managed
>>>> to narrow down what is going wrong, but still need to locate the root
>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>> difficult to instrument the code paths I need without causing it to
>>>> disappear.
>>>>
>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>> caught the wild write a few times now, it is not in the standard
>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>> the io worker punting process.
>>>>
>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>> wild write occurs right before the page is encrypted for write to disk
>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>> is written to disk without any additional corruption, and is then read
>>>> back out from disk into the page verification routine also without any
>>>> additional corruption.  The page verification routine decrypts the
>>>> data from disk, thus restoring the decrypted data that contains the
>>>> wild write data stamped somewhere on it, where we then hit the
>>>> corruption warning and halt the test run.
>>>>
>>>> Irritatingly, if I try to instrument the data flow in the application
>>>> right before the encryption routine, the bug disappears (or, more
>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>> context from somewhere, and scribbling some kind of status data into
>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>> database pages.  This could be an application issue or a kernel issue,
>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>> and less surprised this is only showing on ppc64 right now.
>>>>
>>>> As always, investigation continues...
>>>
>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>> necessarily ppc memory ordering related, but just something in the arch
>>> specific copy section.
>>>
>>> I took a look back, and the initial change actually forgot ppc. Since
>>> then, there's been an attempt to make this generic:
>>>
>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>
>>>    fork: Generalize PF_IO_WORKER handling
>>>
>>> and later a powerpc change related to that too:
>>>
>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>
>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>
>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>> the finer details of powerpc thread creation, but maybe try with this
>>> and see if it makes any difference.
>>>
>>> As you note in your reply, we could very well be corrupting some bytes
>>> somewhere every time. We just only notice quickly when it happens to be
>>> in that specific buffer.
>>>
>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>> index 392404688cec..d4dec2fd091c 100644
>>> --- a/arch/powerpc/kernel/process.c
>>> +++ b/arch/powerpc/kernel/process.c
>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>> kernel_clone_args *args)
>>>
>>> 	klp_init_thread_info(p);
>>>
>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>> 		/* kernel thread */
>>>
>>> 		/* Create initial minimum stack frame. */
>> 
>> Good idea, but didn't work unfortunately.  Any other suggestions
>> welcome while I continue to debug...
> 
> I ponder if it's still in there... I don't see what else could be poking
> and causing user memory corruption.

Indeed, I'm really scratching my head on this one.  The corruption is definitely present, and quite sensitive to exact timing around page encryption start -- presumably if the wild write happens after the encryption routine finishes, it no longer matters for this particular test suite.

Trying to find sensitive areas in the kernel, I hacked it up to always punt at least once per write -- no change in how often the corruption occurred.  I also hacked it up to try to keep I/O workers around vs. constantly tearing then down and respawning them, with no real change observed in corruption frequency, possibly because even with that we still end up creating a new I/O worker every so often.

What did have a major effect was hacking the kernel to both punt at least once per write *and* to aggressively exit I/O worker threads, indicating that something in thread setup or teardown is stomping on memory.  When I say "aggressively exit I/O worker threads", I basically did this:

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..84bfb8b9f068 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -677,6 +688,7 @@ static int io_wq_worker(void *data)
                        exit_mask = !cpumask_test_cpu(raw_smp_processor_id(),
                                                        wq->cpu_mask);
                }
+last_timeout = true;
        }

That single change made it fail on the first or second pass vs somewhere around the 10th pass.

I do note we have our own io_uring-specific thread clone function, create_io_thread(), and I wonder if that clone function does something functionally different on ppc64 than the regular clone function?  Need to dig further...

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 20:47                                                                                         ` Jens Axboe
@ 2023-11-13 21:08                                                                                           ` Timothy Pearson
  0 siblings, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 21:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 2:47:13 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>> When the bug is hit, and if all other conditions are exactly correct,
>> *something* (still to be identified) writes 32 bytes of gibberish into
>> one of the mariadb in-RAM database pages at a random offset.
> 
> Do you have one or more examples of what those 32 bytes look like?
> Ideally more than one. Would be nice if we could deduce something from
> the content there. A long shot, but...

Assuming we're on byte boundaries, here are a few examples.  Problem is since the offset is random it's hard to know for certain how many zeroes start (or end) the block, so I've trimmed the zeroes off both ends of these three examples with the exception of a leading zero needed to get to a byte boundary:

45713881df1783472b45a32d373f59b6
3063d0146087fdfd7d03cc2ff3523588
0fbfcd0b37bc3b861c5bd51c4f0f1365

And I just realized I misspoke, it's 16 bytes (32 nibbles).  They don't really look like pointers, and they don't correlate with anything on disk.  I had already tried to figure out what this data is without much success, it's not even program code...

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 20:58                                                                                               ` Timothy Pearson
@ 2023-11-13 21:22                                                                                                 ` Timothy Pearson
  2023-11-13 22:15                                                                                                 ` Jens Axboe
  1 sibling, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 21:22 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 2:58:30 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 1:29:38 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>> 
>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>> "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>> the outer page checksum was still valid.
>>>>>>
>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>> corruption observed on production systems.
>>>>>
>>>>> Quick status update -- after considerable effort applied I've managed
>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>> difficult to instrument the code paths I need without causing it to
>>>>> disappear.
>>>>>
>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>> caught the wild write a few times now, it is not in the standard
>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>> the io worker punting process.
>>>>>
>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>> is written to disk without any additional corruption, and is then read
>>>>> back out from disk into the page verification routine also without any
>>>>> additional corruption.  The page verification routine decrypts the
>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>> corruption warning and halt the test run.
>>>>>
>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>> right before the encryption routine, the bug disappears (or, more
>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>> context from somewhere, and scribbling some kind of status data into
>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>
>>>>> As always, investigation continues...
>>>>
>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>> necessarily ppc memory ordering related, but just something in the arch
>>>> specific copy section.
>>>>
>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>> then, there's been an attempt to make this generic:
>>>>
>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>
>>>>    fork: Generalize PF_IO_WORKER handling
>>>>
>>>> and later a powerpc change related to that too:
>>>>
>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>
>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>
>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>> the finer details of powerpc thread creation, but maybe try with this
>>>> and see if it makes any difference.
>>>>
>>>> As you note in your reply, we could very well be corrupting some bytes
>>>> somewhere every time. We just only notice quickly when it happens to be
>>>> in that specific buffer.
>>>>
>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>> index 392404688cec..d4dec2fd091c 100644
>>>> --- a/arch/powerpc/kernel/process.c
>>>> +++ b/arch/powerpc/kernel/process.c
>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>> kernel_clone_args *args)
>>>>
>>>> 	klp_init_thread_info(p);
>>>>
>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>> 		/* kernel thread */
>>>>
>>>> 		/* Create initial minimum stack frame. */
>>> 
>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>> welcome while I continue to debug...
>> 
>> I ponder if it's still in there... I don't see what else could be poking
>> and causing user memory corruption.
> 
> Indeed, I'm really scratching my head on this one.  The corruption is definitely
> present, and quite sensitive to exact timing around page encryption start --
> presumably if the wild write happens after the encryption routine finishes, it
> no longer matters for this particular test suite.
> 
> Trying to find sensitive areas in the kernel, I hacked it up to always punt at
> least once per write -- no change in how often the corruption occurred.  I also
> hacked it up to try to keep I/O workers around vs. constantly tearing then down
> and respawning them, with no real change observed in corruption frequency,
> possibly because even with that we still end up creating a new I/O worker every
> so often.
> 
> What did have a major effect was hacking the kernel to both punt at least once
> per write *and* to aggressively exit I/O worker threads, indicating that
> something in thread setup or teardown is stomping on memory.  When I say
> "aggressively exit I/O worker threads", I basically did this:
> 
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 522196dfb0ff..84bfb8b9f068 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -677,6 +688,7 @@ static int io_wq_worker(void *data)
>                        exit_mask = !cpumask_test_cpu(raw_smp_processor_id(),
>                                                        wq->cpu_mask);
>                }
> +last_timeout = true;
>        }
> 
> That single change made it fail on the first or second pass vs somewhere around
> the 10th pass.
> 
> I do note we have our own io_uring-specific thread clone function,
> create_io_thread(), and I wonder if that clone function does something
> functionally different on ppc64 than the regular clone function?  Need to dig
> further...

Correction, that was an older patch.  The actual rapid timeout patch needs another "last_timeout = true;" line added right before "
if (io_run_task_work())" in the same function to ensure the worker exits ASAP even with work sitting on the queue.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 20:58                                                                                               ` Timothy Pearson
  2023-11-13 21:22                                                                                                 ` Timothy Pearson
@ 2023-11-13 22:15                                                                                                 ` Jens Axboe
  2023-11-13 23:19                                                                                                   ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-13 22:15 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 1:58 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 1:29:38 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>> "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>> the outer page checksum was still valid.
>>>>>>
>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>> corruption observed on production systems.
>>>>>
>>>>> Quick status update -- after considerable effort applied I've managed
>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>> difficult to instrument the code paths I need without causing it to
>>>>> disappear.
>>>>>
>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>> caught the wild write a few times now, it is not in the standard
>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>> the io worker punting process.
>>>>>
>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>> is written to disk without any additional corruption, and is then read
>>>>> back out from disk into the page verification routine also without any
>>>>> additional corruption.  The page verification routine decrypts the
>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>> corruption warning and halt the test run.
>>>>>
>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>> right before the encryption routine, the bug disappears (or, more
>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>> context from somewhere, and scribbling some kind of status data into
>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>
>>>>> As always, investigation continues...
>>>>
>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>> necessarily ppc memory ordering related, but just something in the arch
>>>> specific copy section.
>>>>
>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>> then, there's been an attempt to make this generic:
>>>>
>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>
>>>>    fork: Generalize PF_IO_WORKER handling
>>>>
>>>> and later a powerpc change related to that too:
>>>>
>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>
>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>
>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>> the finer details of powerpc thread creation, but maybe try with this
>>>> and see if it makes any difference.
>>>>
>>>> As you note in your reply, we could very well be corrupting some bytes
>>>> somewhere every time. We just only notice quickly when it happens to be
>>>> in that specific buffer.
>>>>
>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>> index 392404688cec..d4dec2fd091c 100644
>>>> --- a/arch/powerpc/kernel/process.c
>>>> +++ b/arch/powerpc/kernel/process.c
>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>> kernel_clone_args *args)
>>>>
>>>> 	klp_init_thread_info(p);
>>>>
>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>> 		/* kernel thread */
>>>>
>>>> 		/* Create initial minimum stack frame. */
>>>
>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>> welcome while I continue to debug...
>>
>> I ponder if it's still in there... I don't see what else could be poking
>> and causing user memory corruption.
> 
> Indeed, I'm really scratching my head on this one.  The corruption is
> definitely present, and quite sensitive to exact timing around page
> encryption start -- presumably if the wild write happens after the
> encryption routine finishes, it no longer matters for this particular
> test suite.
> 
> Trying to find sensitive areas in the kernel, I hacked it up to always
> punt at least once per write -- no change in how often the corruption
> occurred.  I also hacked it up to try to keep I/O workers around vs.
> constantly tearing then down and respawning them, with no real change
> observed in corruption frequency, possibly because even with that we
> still end up creating a new I/O worker every so often.
> 
> What did have a major effect was hacking the kernel to both punt at
> least once per write *and* to aggressively exit I/O worker threads,
> indicating that something in thread setup or teardown is stomping on
> memory.  When I say "aggressively exit I/O worker threads", I
> basically did this:

It's been my suspicion, as per previous email from today, that this is
related to worker creation on ppc. You can try this patch, which just
pre-creates workers and don't let them time out. That means that the max
number of workers for bounded work is pre-created before the ring is
used, so we'll never see any worker creation. If this works, then it's
certainly something related to worker creation.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..e87cabe5bbb7 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -673,7 +673,7 @@ static int io_wq_worker(void *data)
 			break;
 		}
 		if (!ret) {
-			last_timeout = true;
+			// last_timeout = true;
 			exit_mask = !cpumask_test_cpu(raw_smp_processor_id(),
 							wq->cpu_mask);
 		}
@@ -947,8 +947,8 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	do_create = !io_wq_activate_free_worker(wq, acct);
 	rcu_read_unlock();
 
-	if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
-	    !atomic_read(&acct->nr_running))) {
+	if (0 && (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
+	    !atomic_read(&acct->nr_running)))) {
 		bool did_create;
 
 		did_create = io_wq_create_worker(wq, acct);
@@ -1138,6 +1138,33 @@ static int io_wq_hash_wake(struct wait_queue_entry *wait, unsigned mode,
 	return 1;
 }
 
+static void pre_create_workers(struct io_wq *wq)
+{
+	struct io_wq_acct *acct = &wq->acct[IO_WQ_ACCT_BOUND];
+	int i, ret, to_create = acct->max_workers;
+
+	raw_spin_lock(&wq->lock);
+	acct->nr_workers = to_create;
+	atomic_add(to_create, &acct->nr_running);
+	atomic_add(to_create, &wq->worker_refs);
+	raw_spin_unlock(&wq->lock);
+
+	for (i = 0; i < acct->max_workers; i++) {
+		ret = create_io_worker(wq, IO_WQ_ACCT_BOUND);
+		if (WARN_ON_ONCE(!ret))
+			break;
+	}
+
+	if (i != to_create) {
+		to_create -= i;
+		raw_spin_lock(&wq->lock);
+		acct->nr_workers -= to_create;
+		atomic_sub(to_create, &acct->nr_running);
+		atomic_sub(to_create, &wq->worker_refs);
+		raw_spin_unlock(&wq->lock);
+	}
+}
+
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 {
 	int ret, i;
@@ -1187,6 +1214,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	if (ret)
 		goto err;
 
+	pre_create_workers(wq);
 	return wq;
 err:
 	io_wq_put_hash(data->hash);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 22:15                                                                                                 ` Jens Axboe
@ 2023-11-13 23:19                                                                                                   ` Timothy Pearson
  2023-11-13 23:48                                                                                                     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-13 23:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 4:15:44 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>> ----- Original Message -----
>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>> "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>> the outer page checksum was still valid.
>>>>>>>
>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>> corruption observed on production systems.
>>>>>>
>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>> disappear.
>>>>>>
>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>> the io worker punting process.
>>>>>>
>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>> is written to disk without any additional corruption, and is then read
>>>>>> back out from disk into the page verification routine also without any
>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>> corruption warning and halt the test run.
>>>>>>
>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>
>>>>>> As always, investigation continues...
>>>>>
>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>> specific copy section.
>>>>>
>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>> then, there's been an attempt to make this generic:
>>>>>
>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>
>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>
>>>>> and later a powerpc change related to that too:
>>>>>
>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>
>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>
>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>> and see if it makes any difference.
>>>>>
>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>> in that specific buffer.
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>> --- a/arch/powerpc/kernel/process.c
>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>> kernel_clone_args *args)
>>>>>
>>>>> 	klp_init_thread_info(p);
>>>>>
>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>> 		/* kernel thread */
>>>>>
>>>>> 		/* Create initial minimum stack frame. */
>>>>
>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>> welcome while I continue to debug...
>>>
>>> I ponder if it's still in there... I don't see what else could be poking
>>> and causing user memory corruption.
>> 
>> Indeed, I'm really scratching my head on this one.  The corruption is
>> definitely present, and quite sensitive to exact timing around page
>> encryption start -- presumably if the wild write happens after the
>> encryption routine finishes, it no longer matters for this particular
>> test suite.
>> 
>> Trying to find sensitive areas in the kernel, I hacked it up to always
>> punt at least once per write -- no change in how often the corruption
>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>> constantly tearing then down and respawning them, with no real change
>> observed in corruption frequency, possibly because even with that we
>> still end up creating a new I/O worker every so often.
>> 
>> What did have a major effect was hacking the kernel to both punt at
>> least once per write *and* to aggressively exit I/O worker threads,
>> indicating that something in thread setup or teardown is stomping on
>> memory.  When I say "aggressively exit I/O worker threads", I
>> basically did this:
> 
> It's been my suspicion, as per previous email from today, that this is
> related to worker creation on ppc. You can try this patch, which just
> pre-creates workers and don't let them time out. That means that the max
> number of workers for bounded work is pre-created before the ring is
> used, so we'll never see any worker creation. If this works, then it's
> certainly something related to worker creation.

Yep, that makes the issue disappear.  I wish I knew if it it was always stepping on memory somewhere and it just hits unimportant process memory most of the time, or if it's only stepping on memory iff the tight timing conditions are met.

Technically it could be either worker creation or worker destruction.  Any quick way to distinguish between the two?  E.g. create threads, allow them to stop processing by timing out, but never tear them down somehow?  Obviously we'd eventually exhaust the system thread resource limits, but for a quick test it might be enough?

I'm also a bit perplexed as to how we can be stomping on user memory like this without some kind of page fault occurring.  If we can isolate things to thread creation vs. thread teardown, I'll go function by function and see what is going wrong.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 23:19                                                                                                   ` Timothy Pearson
@ 2023-11-13 23:48                                                                                                     ` Jens Axboe
  2023-11-14  0:04                                                                                                       ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-13 23:48 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 4:19 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 4:15:44 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>> "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>> the outer page checksum was still valid.
>>>>>>>>
>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>> corruption observed on production systems.
>>>>>>>
>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>> disappear.
>>>>>>>
>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>> the io worker punting process.
>>>>>>>
>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>> back out from disk into the page verification routine also without any
>>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>> corruption warning and halt the test run.
>>>>>>>
>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>
>>>>>>> As always, investigation continues...
>>>>>>
>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>> specific copy section.
>>>>>>
>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>> then, there's been an attempt to make this generic:
>>>>>>
>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>>
>>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>>
>>>>>> and later a powerpc change related to that too:
>>>>>>
>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>>
>>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>
>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>> and see if it makes any difference.
>>>>>>
>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>> in that specific buffer.
>>>>>>
>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>> kernel_clone_args *args)
>>>>>>
>>>>>> 	klp_init_thread_info(p);
>>>>>>
>>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>> 		/* kernel thread */
>>>>>>
>>>>>> 		/* Create initial minimum stack frame. */
>>>>>
>>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>>> welcome while I continue to debug...
>>>>
>>>> I ponder if it's still in there... I don't see what else could be poking
>>>> and causing user memory corruption.
>>>
>>> Indeed, I'm really scratching my head on this one.  The corruption is
>>> definitely present, and quite sensitive to exact timing around page
>>> encryption start -- presumably if the wild write happens after the
>>> encryption routine finishes, it no longer matters for this particular
>>> test suite.
>>>
>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>> punt at least once per write -- no change in how often the corruption
>>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>>> constantly tearing then down and respawning them, with no real change
>>> observed in corruption frequency, possibly because even with that we
>>> still end up creating a new I/O worker every so often.
>>>
>>> What did have a major effect was hacking the kernel to both punt at
>>> least once per write *and* to aggressively exit I/O worker threads,
>>> indicating that something in thread setup or teardown is stomping on
>>> memory.  When I say "aggressively exit I/O worker threads", I
>>> basically did this:
>>
>> It's been my suspicion, as per previous email from today, that this is
>> related to worker creation on ppc. You can try this patch, which just
>> pre-creates workers and don't let them time out. That means that the max
>> number of workers for bounded work is pre-created before the ring is
>> used, so we'll never see any worker creation. If this works, then it's
>> certainly something related to worker creation.
> 
> Yep, that makes the issue disappear.  I wish I knew if it it was
> always stepping on memory somewhere and it just hits unimportant
> process memory most of the time, or if it's only stepping on memory
> iff the tight timing conditions are met.
> 
> Technically it could be either worker creation or worker destruction.
> Any quick way to distinguish between the two?  E.g. create threads,
> allow them to stop processing by timing out, but never tear them down
> somehow?  Obviously we'd eventually exhaust the system thread resource
> limits, but for a quick test it might be enough?

Sure, we could certainly do that. Something like the below should do
that, it goes through the normal teardown on timeout, but doesn't
actually call do_exit() until the wq is being torn down anyway. That
should ensure that we create workers as we need them, but when they time
out, they won't actually exit until we are tearing down anyway on ring
exit.

> I'm also a bit perplexed as to how we can be stomping on user memory
> like this without some kind of page fault occurring.  If we can
> isolate things to thread creation vs. thread teardown, I'll go
> function by function and see what is going wrong.

Agree, it's very odd.

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..51a82daaac36 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -16,6 +16,7 @@
 #include <linux/task_work.h>
 #include <linux/audit.h>
 #include <linux/mmu_context.h>
+#include <linux/delay.h>
 #include <uapi/linux/io_uring.h>
 
 #include "io-wq.h"
@@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
 
 	kfree_rcu(worker, rcu);
 	io_worker_ref_put(wq);
+
+	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
+		msleep(500);
 	do_exit(0);
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-13 23:48                                                                                                     ` Jens Axboe
@ 2023-11-14  0:04                                                                                                       ` Timothy Pearson
  2023-11-14  0:13                                                                                                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14  0:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 5:48:12 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 4:19 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 4:15:44 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>>> "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>
>>>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>>> the outer page checksum was still valid.
>>>>>>>>>
>>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>>> corruption observed on production systems.
>>>>>>>>
>>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>>> disappear.
>>>>>>>>
>>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>>> the io worker punting process.
>>>>>>>>
>>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>>> back out from disk into the page verification routine also without any
>>>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>>> corruption warning and halt the test run.
>>>>>>>>
>>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>>
>>>>>>>> As always, investigation continues...
>>>>>>>
>>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>>> specific copy section.
>>>>>>>
>>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>>> then, there's been an attempt to make this generic:
>>>>>>>
>>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>>>
>>>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>>>
>>>>>>> and later a powerpc change related to that too:
>>>>>>>
>>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>>>
>>>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>>
>>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>>> and see if it makes any difference.
>>>>>>>
>>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>>> in that specific buffer.
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>>> kernel_clone_args *args)
>>>>>>>
>>>>>>> 	klp_init_thread_info(p);
>>>>>>>
>>>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>>> 		/* kernel thread */
>>>>>>>
>>>>>>> 		/* Create initial minimum stack frame. */
>>>>>>
>>>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>>>> welcome while I continue to debug...
>>>>>
>>>>> I ponder if it's still in there... I don't see what else could be poking
>>>>> and causing user memory corruption.
>>>>
>>>> Indeed, I'm really scratching my head on this one.  The corruption is
>>>> definitely present, and quite sensitive to exact timing around page
>>>> encryption start -- presumably if the wild write happens after the
>>>> encryption routine finishes, it no longer matters for this particular
>>>> test suite.
>>>>
>>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>>> punt at least once per write -- no change in how often the corruption
>>>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>>>> constantly tearing then down and respawning them, with no real change
>>>> observed in corruption frequency, possibly because even with that we
>>>> still end up creating a new I/O worker every so often.
>>>>
>>>> What did have a major effect was hacking the kernel to both punt at
>>>> least once per write *and* to aggressively exit I/O worker threads,
>>>> indicating that something in thread setup or teardown is stomping on
>>>> memory.  When I say "aggressively exit I/O worker threads", I
>>>> basically did this:
>>>
>>> It's been my suspicion, as per previous email from today, that this is
>>> related to worker creation on ppc. You can try this patch, which just
>>> pre-creates workers and don't let them time out. That means that the max
>>> number of workers for bounded work is pre-created before the ring is
>>> used, so we'll never see any worker creation. If this works, then it's
>>> certainly something related to worker creation.
>> 
>> Yep, that makes the issue disappear.  I wish I knew if it it was
>> always stepping on memory somewhere and it just hits unimportant
>> process memory most of the time, or if it's only stepping on memory
>> iff the tight timing conditions are met.
>> 
>> Technically it could be either worker creation or worker destruction.
>> Any quick way to distinguish between the two?  E.g. create threads,
>> allow them to stop processing by timing out, but never tear them down
>> somehow?  Obviously we'd eventually exhaust the system thread resource
>> limits, but for a quick test it might be enough?
> 
> Sure, we could certainly do that. Something like the below should do
> that, it goes through the normal teardown on timeout, but doesn't
> actually call do_exit() until the wq is being torn down anyway. That
> should ensure that we create workers as we need them, but when they time
> out, they won't actually exit until we are tearing down anyway on ring
> exit.
> 
>> I'm also a bit perplexed as to how we can be stomping on user memory
>> like this without some kind of page fault occurring.  If we can
>> isolate things to thread creation vs. thread teardown, I'll go
>> function by function and see what is going wrong.
> 
> Agree, it's very odd.
> 
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 522196dfb0ff..51a82daaac36 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -16,6 +16,7 @@
> #include <linux/task_work.h>
> #include <linux/audit.h>
> #include <linux/mmu_context.h>
> +#include <linux/delay.h>
> #include <uapi/linux/io_uring.h>
> 
> #include "io-wq.h"
> @@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
> 
> 	kfree_rcu(worker, rcu);
> 	io_worker_ref_put(wq);
> +
> +	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
> +		msleep(500);
> 	do_exit(0);
> }

Thanks, was working up something similar but neglected the workqueue exit so got a hang.  With this patch, I still see the corruption, but all that's really telling me is that the core code inside do_exit() is OK (including, hopefully, the arch-specific stuff).  I'd really like to rule out the rest of the code in io_worker_exit(), is there a way to (easily) tell the workqueue to ignore a worker entirely without going through all the teardown in io_worker_exit() (and specifically the cancellation / release code)?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14  0:04                                                                                                       ` Timothy Pearson
@ 2023-11-14  0:13                                                                                                         ` Jens Axboe
  2023-11-14  0:52                                                                                                           ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14  0:13 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 5:04 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 5:48:12 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 4:19 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 4:15:44 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>>>> "Pavel Begunkov"
>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>
>>>>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>>>> the outer page checksum was still valid.
>>>>>>>>>>
>>>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>>>> corruption observed on production systems.
>>>>>>>>>
>>>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>>>> disappear.
>>>>>>>>>
>>>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>>>> the io worker punting process.
>>>>>>>>>
>>>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>>>> back out from disk into the page verification routine also without any
>>>>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>>>> corruption warning and halt the test run.
>>>>>>>>>
>>>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>>>
>>>>>>>>> As always, investigation continues...
>>>>>>>>
>>>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>>>> specific copy section.
>>>>>>>>
>>>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>>>> then, there's been an attempt to make this generic:
>>>>>>>>
>>>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>>>>
>>>>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>>>>
>>>>>>>> and later a powerpc change related to that too:
>>>>>>>>
>>>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>>>>
>>>>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>>>
>>>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>>>> and see if it makes any difference.
>>>>>>>>
>>>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>>>> in that specific buffer.
>>>>>>>>
>>>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>>>> kernel_clone_args *args)
>>>>>>>>
>>>>>>>> 	klp_init_thread_info(p);
>>>>>>>>
>>>>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>>>> 		/* kernel thread */
>>>>>>>>
>>>>>>>> 		/* Create initial minimum stack frame. */
>>>>>>>
>>>>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>>>>> welcome while I continue to debug...
>>>>>>
>>>>>> I ponder if it's still in there... I don't see what else could be poking
>>>>>> and causing user memory corruption.
>>>>>
>>>>> Indeed, I'm really scratching my head on this one.  The corruption is
>>>>> definitely present, and quite sensitive to exact timing around page
>>>>> encryption start -- presumably if the wild write happens after the
>>>>> encryption routine finishes, it no longer matters for this particular
>>>>> test suite.
>>>>>
>>>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>>>> punt at least once per write -- no change in how often the corruption
>>>>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>>>>> constantly tearing then down and respawning them, with no real change
>>>>> observed in corruption frequency, possibly because even with that we
>>>>> still end up creating a new I/O worker every so often.
>>>>>
>>>>> What did have a major effect was hacking the kernel to both punt at
>>>>> least once per write *and* to aggressively exit I/O worker threads,
>>>>> indicating that something in thread setup or teardown is stomping on
>>>>> memory.  When I say "aggressively exit I/O worker threads", I
>>>>> basically did this:
>>>>
>>>> It's been my suspicion, as per previous email from today, that this is
>>>> related to worker creation on ppc. You can try this patch, which just
>>>> pre-creates workers and don't let them time out. That means that the max
>>>> number of workers for bounded work is pre-created before the ring is
>>>> used, so we'll never see any worker creation. If this works, then it's
>>>> certainly something related to worker creation.
>>>
>>> Yep, that makes the issue disappear.  I wish I knew if it it was
>>> always stepping on memory somewhere and it just hits unimportant
>>> process memory most of the time, or if it's only stepping on memory
>>> iff the tight timing conditions are met.
>>>
>>> Technically it could be either worker creation or worker destruction.
>>> Any quick way to distinguish between the two?  E.g. create threads,
>>> allow them to stop processing by timing out, but never tear them down
>>> somehow?  Obviously we'd eventually exhaust the system thread resource
>>> limits, but for a quick test it might be enough?
>>
>> Sure, we could certainly do that. Something like the below should do
>> that, it goes through the normal teardown on timeout, but doesn't
>> actually call do_exit() until the wq is being torn down anyway. That
>> should ensure that we create workers as we need them, but when they time
>> out, they won't actually exit until we are tearing down anyway on ring
>> exit.
>>
>>> I'm also a bit perplexed as to how we can be stomping on user memory
>>> like this without some kind of page fault occurring.  If we can
>>> isolate things to thread creation vs. thread teardown, I'll go
>>> function by function and see what is going wrong.
>>
>> Agree, it's very odd.
>>
>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>> index 522196dfb0ff..51a82daaac36 100644
>> --- a/io_uring/io-wq.c
>> +++ b/io_uring/io-wq.c
>> @@ -16,6 +16,7 @@
>> #include <linux/task_work.h>
>> #include <linux/audit.h>
>> #include <linux/mmu_context.h>
>> +#include <linux/delay.h>
>> #include <uapi/linux/io_uring.h>
>>
>> #include "io-wq.h"
>> @@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
>>
>> 	kfree_rcu(worker, rcu);
>> 	io_worker_ref_put(wq);
>> +
>> +	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
>> +		msleep(500);
>> 	do_exit(0);
>> }
> 
> Thanks, was working up something similar but neglected the workqueue
> exit so got a hang.  With this patch, I still see the corruption, but
> all that's really telling me is that the core code inside do_exit() is
> OK (including, hopefully, the arch-specific stuff).  I'd really like
> to rule out the rest of the code in io_worker_exit(), is there a way
> to (easily) tell the workqueue to ignore a worker entirely without
> going through all the teardown in io_worker_exit() (and specifically
> the cancellation / release code)?

You could try this one - totally untested...


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..a72e5b6eb980 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -16,6 +16,7 @@
 #include <linux/task_work.h>
 #include <linux/audit.h>
 #include <linux/mmu_context.h>
+#include <linux/delay.h>
 #include <uapi/linux/io_uring.h>
 
 #include "io-wq.h"
@@ -193,6 +194,10 @@ static void io_worker_cancel_cb(struct io_worker *worker)
 	raw_spin_lock(&wq->lock);
 	acct->nr_workers--;
 	raw_spin_unlock(&wq->lock);
+
+	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
+		msleep(500);
+
 	io_worker_ref_put(wq);
 	clear_bit_unlock(0, &worker->create_state);
 	io_worker_release(worker);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14  0:13                                                                                                         ` Jens Axboe
@ 2023-11-14  0:52                                                                                                           ` Timothy Pearson
  2023-11-14  5:06                                                                                                             ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14  0:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 6:13:10 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 5:04 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 5:48:12 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 4:19 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 4:15:44 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>>>>> "Pavel Begunkov"
>>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>>>>> the outer page checksum was still valid.
>>>>>>>>>>>
>>>>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>>>>> corruption observed on production systems.
>>>>>>>>>>
>>>>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>>>>> disappear.
>>>>>>>>>>
>>>>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>>>>> the io worker punting process.
>>>>>>>>>>
>>>>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>>>>> back out from disk into the page verification routine also without any
>>>>>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>>>>> corruption warning and halt the test run.
>>>>>>>>>>
>>>>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>>>>
>>>>>>>>>> As always, investigation continues...
>>>>>>>>>
>>>>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>>>>> specific copy section.
>>>>>>>>>
>>>>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>>>>> then, there's been an attempt to make this generic:
>>>>>>>>>
>>>>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>>>>>
>>>>>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>>>>>
>>>>>>>>> and later a powerpc change related to that too:
>>>>>>>>>
>>>>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>>>>>
>>>>>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>>>>
>>>>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>>>>> and see if it makes any difference.
>>>>>>>>>
>>>>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>>>>> in that specific buffer.
>>>>>>>>>
>>>>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>>>>> kernel_clone_args *args)
>>>>>>>>>
>>>>>>>>> 	klp_init_thread_info(p);
>>>>>>>>>
>>>>>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>>>>> 		/* kernel thread */
>>>>>>>>>
>>>>>>>>> 		/* Create initial minimum stack frame. */
>>>>>>>>
>>>>>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>>>>>> welcome while I continue to debug...
>>>>>>>
>>>>>>> I ponder if it's still in there... I don't see what else could be poking
>>>>>>> and causing user memory corruption.
>>>>>>
>>>>>> Indeed, I'm really scratching my head on this one.  The corruption is
>>>>>> definitely present, and quite sensitive to exact timing around page
>>>>>> encryption start -- presumably if the wild write happens after the
>>>>>> encryption routine finishes, it no longer matters for this particular
>>>>>> test suite.
>>>>>>
>>>>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>>>>> punt at least once per write -- no change in how often the corruption
>>>>>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>>>>>> constantly tearing then down and respawning them, with no real change
>>>>>> observed in corruption frequency, possibly because even with that we
>>>>>> still end up creating a new I/O worker every so often.
>>>>>>
>>>>>> What did have a major effect was hacking the kernel to both punt at
>>>>>> least once per write *and* to aggressively exit I/O worker threads,
>>>>>> indicating that something in thread setup or teardown is stomping on
>>>>>> memory.  When I say "aggressively exit I/O worker threads", I
>>>>>> basically did this:
>>>>>
>>>>> It's been my suspicion, as per previous email from today, that this is
>>>>> related to worker creation on ppc. You can try this patch, which just
>>>>> pre-creates workers and don't let them time out. That means that the max
>>>>> number of workers for bounded work is pre-created before the ring is
>>>>> used, so we'll never see any worker creation. If this works, then it's
>>>>> certainly something related to worker creation.
>>>>
>>>> Yep, that makes the issue disappear.  I wish I knew if it it was
>>>> always stepping on memory somewhere and it just hits unimportant
>>>> process memory most of the time, or if it's only stepping on memory
>>>> iff the tight timing conditions are met.
>>>>
>>>> Technically it could be either worker creation or worker destruction.
>>>> Any quick way to distinguish between the two?  E.g. create threads,
>>>> allow them to stop processing by timing out, but never tear them down
>>>> somehow?  Obviously we'd eventually exhaust the system thread resource
>>>> limits, but for a quick test it might be enough?
>>>
>>> Sure, we could certainly do that. Something like the below should do
>>> that, it goes through the normal teardown on timeout, but doesn't
>>> actually call do_exit() until the wq is being torn down anyway. That
>>> should ensure that we create workers as we need them, but when they time
>>> out, they won't actually exit until we are tearing down anyway on ring
>>> exit.
>>>
>>>> I'm also a bit perplexed as to how we can be stomping on user memory
>>>> like this without some kind of page fault occurring.  If we can
>>>> isolate things to thread creation vs. thread teardown, I'll go
>>>> function by function and see what is going wrong.
>>>
>>> Agree, it's very odd.
>>>
>>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>>> index 522196dfb0ff..51a82daaac36 100644
>>> --- a/io_uring/io-wq.c
>>> +++ b/io_uring/io-wq.c
>>> @@ -16,6 +16,7 @@
>>> #include <linux/task_work.h>
>>> #include <linux/audit.h>
>>> #include <linux/mmu_context.h>
>>> +#include <linux/delay.h>
>>> #include <uapi/linux/io_uring.h>
>>>
>>> #include "io-wq.h"
>>> @@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
>>>
>>> 	kfree_rcu(worker, rcu);
>>> 	io_worker_ref_put(wq);
>>> +
>>> +	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
>>> +		msleep(500);
>>> 	do_exit(0);
>>> }
>> 
>> Thanks, was working up something similar but neglected the workqueue
>> exit so got a hang.  With this patch, I still see the corruption, but
>> all that's really telling me is that the core code inside do_exit() is
>> OK (including, hopefully, the arch-specific stuff).  I'd really like
>> to rule out the rest of the code in io_worker_exit(), is there a way
>> to (easily) tell the workqueue to ignore a worker entirely without
>> going through all the teardown in io_worker_exit() (and specifically
>> the cancellation / release code)?
> 
> You could try this one - totally untested...

Seemed to do what I wanted, however corruption remains.  Was worth a try...

I guess I'll proceed under the assumption that somehow the thread setup is stomping on userspace for now.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14  0:52                                                                                                           ` Timothy Pearson
@ 2023-11-14  5:06                                                                                                             ` Timothy Pearson
  2023-11-14 13:17                                                                                                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14  5:06 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 6:52:51 PM
> Subject: Re: Regression in io_uring, leading to data corruption
> 
> Seemed to do what I wanted, however corruption remains.  Was worth a try...
> 
> I guess I'll proceed under the assumption that somehow the thread setup is
> stomping on userspace for now.

Finally found it!  Patch here:

https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u

We were missing the appropriate barrier instruction inside the IPI issued by the task_work_add() call, which was in turn issued during queued io worker creation.  The explaination for how userspace was getting corrupted boils down to an inconsistent view of main memory as seen the two cores involved in the worker handoff and new worker creation.

Who doesn't love a one line fix after a week and a half of pulling one's hair out? ;)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14  5:06                                                                                                             ` Timothy Pearson
@ 2023-11-14 13:17                                                                                                               ` Jens Axboe
  2023-11-14 16:59                                                                                                                 ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14 13:17 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/13/23 10:06 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "Jens Axboe" <axboe@kernel.dk>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 6:52:51 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>
>> I guess I'll proceed under the assumption that somehow the thread setup is
>> stomping on userspace for now.
> 
> Finally found it!  Patch here:
> 
> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
> 
> We were missing the appropriate barrier instruction inside the IPI
> issued by the task_work_add() call, which was in turn issued during
> queued io worker creation.  The explaination for how userspace was
> getting corrupted boils down to an inconsistent view of main memory as
> seen the two cores involved in the worker handoff and new worker
> creation.
> 
> Who doesn't love a one line fix after a week and a half of pulling
> one's hair out? ;)

Hate to be a debbie downer, but it still fails for me with that patch:

debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1 --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0 --repeat=500
[...]
encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
        Test ended at 2023-11-14 13:02:52

CURRENT_TEST: encryption.innodb_encryption
mysqltest: At line 11: query 'SET @start_global_value = @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193): Unknown system variable 'innodb_encryption_threads'

The result from queries just before the failure was:
SET @start_global_value = @@global.innodb_encryption_threads;

 - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
***Warnings generated in error logs during shutdown after running tests: encryption.innodb_encryption

2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity level experimental while the server is stable
2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './ibdata1' page [page id: space=0, page number=220]. You may have to recover from a backup.
2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error Page read from tablespace is corrupted.
2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.

after about 40 loops. How consistent is it on your end? I know I've
suspected the signaling before and tried the NO_IPI variant, but that
still hit it.

If it makes a difference for you, it must be close. But don't think it's
quite there yet, unfortunately :-(

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 13:17                                                                                                               ` Jens Axboe
@ 2023-11-14 16:59                                                                                                                 ` Timothy Pearson
  2023-11-14 17:04                                                                                                                   ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14 16:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 14, 2023 7:17:22 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> To: "Jens Axboe" <axboe@kernel.dk>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>
>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>> stomping on userspace for now.
>> 
>> Finally found it!  Patch here:
>> 
>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>> 
>> We were missing the appropriate barrier instruction inside the IPI
>> issued by the task_work_add() call, which was in turn issued during
>> queued io worker creation.  The explaination for how userspace was
>> getting corrupted boils down to an inconsistent view of main memory as
>> seen the two cores involved in the worker handoff and new worker
>> creation.
>> 
>> Who doesn't love a one line fix after a week and a half of pulling
>> one's hair out? ;)
> 
> Hate to be a debbie downer, but it still fails for me with that patch:
> 
> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
> --repeat=500
> [...]
> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>        Test ended at 2023-11-14 13:02:52
> 
> CURRENT_TEST: encryption.innodb_encryption
> mysqltest: At line 11: query 'SET @start_global_value =
> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
> Unknown system variable 'innodb_encryption_threads'
> 
> The result from queries just before the failure was:
> SET @start_global_value = @@global.innodb_encryption_threads;
> 
> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
> ***Warnings generated in error logs during shutdown after running tests:
> encryption.innodb_encryption
> 
> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
> level experimental while the server is stable
> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
> may have to recover from a backup.
> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
> Page read from tablespace is corrupted.
> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
> failed.
> 
> after about 40 loops. How consistent is it on your end? I know I've
> suspected the signaling before and tried the NO_IPI variant, but that
> still hit it.
> 
> If it makes a difference for you, it must be close. But don't think it's
> quite there yet, unfortunately :-(

Interesting!  I've run it for thousands of loops on my test VM, no failures.  Let me try on the shared test box.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 16:59                                                                                                                 ` Timothy Pearson
@ 2023-11-14 17:04                                                                                                                   ` Jens Axboe
  2023-11-14 17:14                                                                                                                     ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14 17:04 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/14/23 9:59 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>
>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>> stomping on userspace for now.
>>>
>>> Finally found it!  Patch here:
>>>
>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>
>>> We were missing the appropriate barrier instruction inside the IPI
>>> issued by the task_work_add() call, which was in turn issued during
>>> queued io worker creation.  The explaination for how userspace was
>>> getting corrupted boils down to an inconsistent view of main memory as
>>> seen the two cores involved in the worker handoff and new worker
>>> creation.
>>>
>>> Who doesn't love a one line fix after a week and a half of pulling
>>> one's hair out? ;)
>>
>> Hate to be a debbie downer, but it still fails for me with that patch:
>>
>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>> --repeat=500
>> [...]
>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>        Test ended at 2023-11-14 13:02:52
>>
>> CURRENT_TEST: encryption.innodb_encryption
>> mysqltest: At line 11: query 'SET @start_global_value =
>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>> Unknown system variable 'innodb_encryption_threads'
>>
>> The result from queries just before the failure was:
>> SET @start_global_value = @@global.innodb_encryption_threads;
>>
>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>> ***Warnings generated in error logs during shutdown after running tests:
>> encryption.innodb_encryption
>>
>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>> level experimental while the server is stable
>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>> may have to recover from a backup.
>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>> Page read from tablespace is corrupted.
>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>> failed.
>>
>> after about 40 loops. How consistent is it on your end? I know I've
>> suspected the signaling before and tried the NO_IPI variant, but that
>> still hit it.
>>
>> If it makes a difference for you, it must be close. But don't think it's
>> quite there yet, unfortunately :-(
> 
> Interesting!  I've run it for thousands of loops on my test VM, no
> failures.  Let me try on the shared test box.

So odd. I guess it must be close if it's enough to make yours sane. The
other vm I've been using is just a small vm somewhere else, it has 8
threads of:

debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo 
processor	: 0
cpu		: POWER9 (architected), altivec supported
clock		: 2200.000000MHz
revision	: 2.2 (pvr 004e 1202)
[...]
timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)
MMU		: Radix

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 17:04                                                                                                                   ` Jens Axboe
@ 2023-11-14 17:14                                                                                                                     ` Timothy Pearson
  2023-11-14 17:17                                                                                                                       ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14 17:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 14, 2023 11:04:03 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>
>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>> stomping on userspace for now.
>>>>
>>>> Finally found it!  Patch here:
>>>>
>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>
>>>> We were missing the appropriate barrier instruction inside the IPI
>>>> issued by the task_work_add() call, which was in turn issued during
>>>> queued io worker creation.  The explaination for how userspace was
>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>> seen the two cores involved in the worker handoff and new worker
>>>> creation.
>>>>
>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>> one's hair out? ;)
>>>
>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>
>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>> --repeat=500
>>> [...]
>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>        Test ended at 2023-11-14 13:02:52
>>>
>>> CURRENT_TEST: encryption.innodb_encryption
>>> mysqltest: At line 11: query 'SET @start_global_value =
>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>> Unknown system variable 'innodb_encryption_threads'
>>>
>>> The result from queries just before the failure was:
>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>
>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>> ***Warnings generated in error logs during shutdown after running tests:
>>> encryption.innodb_encryption
>>>
>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>> level experimental while the server is stable
>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>> may have to recover from a backup.
>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>> Page read from tablespace is corrupted.
>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>> failed.
>>>
>>> after about 40 loops. How consistent is it on your end? I know I've
>>> suspected the signaling before and tried the NO_IPI variant, but that
>>> still hit it.
>>>
>>> If it makes a difference for you, it must be close. But don't think it's
>>> quite there yet, unfortunately :-(
>> 
>> Interesting!  I've run it for thousands of loops on my test VM, no
>> failures.  Let me try on the shared test box.
> 
> So odd. I guess it must be close if it's enough to make yours sane. The
> other vm I've been using is just a small vm somewhere else, it has 8
> threads of:
> 
> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
> processor	: 0
> cpu		: POWER9 (architected), altivec supported
> clock		: 2200.000000MHz
> revision	: 2.2 (pvr 004e 1202)
> [...]
> timebase	: 512000000
> platform	: pSeries
> model		: IBM pSeries (emulated by qemu)
> machine		: CHRP IBM pSeries (emulated by qemu)
> MMU		: Radix

At this point my concern is there might be two bugs, one related to the IPI and one related to the direct creation path.  Since I can't reproduce easily on my end, can you see if the box you have acecss to is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 17:14                                                                                                                     ` Timothy Pearson
@ 2023-11-14 17:17                                                                                                                       ` Jens Axboe
  2023-11-14 17:21                                                                                                                         ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14 17:17 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/14/23 10:14 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>
>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>> stomping on userspace for now.
>>>>>
>>>>> Finally found it!  Patch here:
>>>>>
>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>
>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>> queued io worker creation.  The explaination for how userspace was
>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>> seen the two cores involved in the worker handoff and new worker
>>>>> creation.
>>>>>
>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>> one's hair out? ;)
>>>>
>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>
>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>> --repeat=500
>>>> [...]
>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>        Test ended at 2023-11-14 13:02:52
>>>>
>>>> CURRENT_TEST: encryption.innodb_encryption
>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>> Unknown system variable 'innodb_encryption_threads'
>>>>
>>>> The result from queries just before the failure was:
>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>
>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>> encryption.innodb_encryption
>>>>
>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>> level experimental while the server is stable
>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>> may have to recover from a backup.
>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>> Page read from tablespace is corrupted.
>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>> failed.
>>>>
>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>> still hit it.
>>>>
>>>> If it makes a difference for you, it must be close. But don't think it's
>>>> quite there yet, unfortunately :-(
>>>
>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>> failures.  Let me try on the shared test box.
>>
>> So odd. I guess it must be close if it's enough to make yours sane. The
>> other vm I've been using is just a small vm somewhere else, it has 8
>> threads of:
>>
>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>> processor	: 0
>> cpu		: POWER9 (architected), altivec supported
>> clock		: 2200.000000MHz
>> revision	: 2.2 (pvr 004e 1202)
>> [...]
>> timebase	: 512000000
>> platform	: pSeries
>> model		: IBM pSeries (emulated by qemu)
>> machine		: CHRP IBM pSeries (emulated by qemu)
>> MMU		: Radix
> 
> At this point my concern is there might be two bugs, one related to
> the IPI and one related to the direct creation path.  Since I can't
> reproduce easily on my end, can you see if the box you have acecss to
> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?

It's hitting it, and doing the worker pre-create like in the patch I
sent you makes the bug go away here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 17:17                                                                                                                       ` Jens Axboe
@ 2023-11-14 17:21                                                                                                                         ` Timothy Pearson
  2023-11-14 17:57                                                                                                                           ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 14, 2023 11:17:14 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/14/23 10:14 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>>
>>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>>> stomping on userspace for now.
>>>>>>
>>>>>> Finally found it!  Patch here:
>>>>>>
>>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>>
>>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>>> queued io worker creation.  The explaination for how userspace was
>>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>>> seen the two cores involved in the worker handoff and new worker
>>>>>> creation.
>>>>>>
>>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>>> one's hair out? ;)
>>>>>
>>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>>
>>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>>> --repeat=500
>>>>> [...]
>>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>>        Test ended at 2023-11-14 13:02:52
>>>>>
>>>>> CURRENT_TEST: encryption.innodb_encryption
>>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>>> Unknown system variable 'innodb_encryption_threads'
>>>>>
>>>>> The result from queries just before the failure was:
>>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>>
>>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>>> encryption.innodb_encryption
>>>>>
>>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>>> level experimental while the server is stable
>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>>> may have to recover from a backup.
>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>>> Page read from tablespace is corrupted.
>>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>>> failed.
>>>>>
>>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>>> still hit it.
>>>>>
>>>>> If it makes a difference for you, it must be close. But don't think it's
>>>>> quite there yet, unfortunately :-(
>>>>
>>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>>> failures.  Let me try on the shared test box.
>>>
>>> So odd. I guess it must be close if it's enough to make yours sane. The
>>> other vm I've been using is just a small vm somewhere else, it has 8
>>> threads of:
>>>
>>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>>> processor	: 0
>>> cpu		: POWER9 (architected), altivec supported
>>> clock		: 2200.000000MHz
>>> revision	: 2.2 (pvr 004e 1202)
>>> [...]
>>> timebase	: 512000000
>>> platform	: pSeries
>>> model		: IBM pSeries (emulated by qemu)
>>> machine		: CHRP IBM pSeries (emulated by qemu)
>>> MMU		: Radix
>> 
>> At this point my concern is there might be two bugs, one related to
>> the IPI and one related to the direct creation path.  Since I can't
>> reproduce easily on my end, can you see if the box you have acecss to
>> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?
> 
> It's hitting it, and doing the worker pre-create like in the patch I
> sent you makes the bug go away here.

That's what I thought.  OK, let me chew on this for a bit.  I'm not hitting it at all, my worker creation goes through the deferred path only hence why adding the barrier to the IPI fully resolved the problem on my end.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 17:21                                                                                                                         ` Timothy Pearson
@ 2023-11-14 17:57                                                                                                                           ` Timothy Pearson
  2023-11-14 18:02                                                                                                                             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14 17:57 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Jens Axboe, regressions, Pavel Begunkov



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 14, 2023 11:21:32 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>> <asml.silence@gmail.com>
>> Sent: Tuesday, November 14, 2023 11:17:14 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/14/23 10:14 AM, Timothy Pearson wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>> 
>>>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>>>
>>>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>>>> stomping on userspace for now.
>>>>>>>
>>>>>>> Finally found it!  Patch here:
>>>>>>>
>>>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>>>
>>>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>>>> queued io worker creation.  The explaination for how userspace was
>>>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>>>> seen the two cores involved in the worker handoff and new worker
>>>>>>> creation.
>>>>>>>
>>>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>>>> one's hair out? ;)
>>>>>>
>>>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>>>
>>>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>>>> --repeat=500
>>>>>> [...]
>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>>>        Test ended at 2023-11-14 13:02:52
>>>>>>
>>>>>> CURRENT_TEST: encryption.innodb_encryption
>>>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>>>> Unknown system variable 'innodb_encryption_threads'
>>>>>>
>>>>>> The result from queries just before the failure was:
>>>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>>>
>>>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>>>> encryption.innodb_encryption
>>>>>>
>>>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>>>> level experimental while the server is stable
>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>>>> may have to recover from a backup.
>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>>>> Page read from tablespace is corrupted.
>>>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>>>> failed.
>>>>>>
>>>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>>>> still hit it.
>>>>>>
>>>>>> If it makes a difference for you, it must be close. But don't think it's
>>>>>> quite there yet, unfortunately :-(
>>>>>
>>>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>>>> failures.  Let me try on the shared test box.
>>>>
>>>> So odd. I guess it must be close if it's enough to make yours sane. The
>>>> other vm I've been using is just a small vm somewhere else, it has 8
>>>> threads of:
>>>>
>>>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>>>> processor	: 0
>>>> cpu		: POWER9 (architected), altivec supported
>>>> clock		: 2200.000000MHz
>>>> revision	: 2.2 (pvr 004e 1202)
>>>> [...]
>>>> timebase	: 512000000
>>>> platform	: pSeries
>>>> model		: IBM pSeries (emulated by qemu)
>>>> machine		: CHRP IBM pSeries (emulated by qemu)
>>>> MMU		: Radix
>>> 
>>> At this point my concern is there might be two bugs, one related to
>>> the IPI and one related to the direct creation path.  Since I can't
>>> reproduce easily on my end, can you see if the box you have acecss to
>>> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?
>> 
>> It's hitting it, and doing the worker pre-create like in the patch I
>> sent you makes the bug go away here.
> 
> That's what I thought.  OK, let me chew on this for a bit.  I'm not hitting it
> at all, my worker creation goes through the deferred path only hence why adding
> the barrier to the IPI fully resolved the problem on my end.

It's a stab in the dark, but does adding smb_mb() right before the io_wq_create_worker() call fix anything?

What mariadb version are you testing with?  I can't seem to get into the io_wq_enqueue() path at all.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 17:57                                                                                                                           ` Timothy Pearson
@ 2023-11-14 18:02                                                                                                                             ` Jens Axboe
  2023-11-14 18:12                                                                                                                               ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14 18:02 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/14/23 10:57 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "Jens Axboe" <axboe@kernel.dk>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 14, 2023 11:21:32 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 14, 2023 11:17:14 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/14/23 10:14 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>
>>>>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>>>>
>>>>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>>>>> stomping on userspace for now.
>>>>>>>>
>>>>>>>> Finally found it!  Patch here:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>>>>
>>>>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>>>>> queued io worker creation.  The explaination for how userspace was
>>>>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>>>>> seen the two cores involved in the worker handoff and new worker
>>>>>>>> creation.
>>>>>>>>
>>>>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>>>>> one's hair out? ;)
>>>>>>>
>>>>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>>>>
>>>>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>>>>> --repeat=500
>>>>>>> [...]
>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>>>>        Test ended at 2023-11-14 13:02:52
>>>>>>>
>>>>>>> CURRENT_TEST: encryption.innodb_encryption
>>>>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>>>>> Unknown system variable 'innodb_encryption_threads'
>>>>>>>
>>>>>>> The result from queries just before the failure was:
>>>>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>>>>
>>>>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>>>>> encryption.innodb_encryption
>>>>>>>
>>>>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>>>>> level experimental while the server is stable
>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>>>>> may have to recover from a backup.
>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>>>>> Page read from tablespace is corrupted.
>>>>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>>>>> failed.
>>>>>>>
>>>>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>>>>> still hit it.
>>>>>>>
>>>>>>> If it makes a difference for you, it must be close. But don't think it's
>>>>>>> quite there yet, unfortunately :-(
>>>>>>
>>>>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>>>>> failures.  Let me try on the shared test box.
>>>>>
>>>>> So odd. I guess it must be close if it's enough to make yours sane. The
>>>>> other vm I've been using is just a small vm somewhere else, it has 8
>>>>> threads of:
>>>>>
>>>>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>>>>> processor	: 0
>>>>> cpu		: POWER9 (architected), altivec supported
>>>>> clock		: 2200.000000MHz
>>>>> revision	: 2.2 (pvr 004e 1202)
>>>>> [...]
>>>>> timebase	: 512000000
>>>>> platform	: pSeries
>>>>> model		: IBM pSeries (emulated by qemu)
>>>>> machine		: CHRP IBM pSeries (emulated by qemu)
>>>>> MMU		: Radix
>>>>
>>>> At this point my concern is there might be two bugs, one related to
>>>> the IPI and one related to the direct creation path.  Since I can't
>>>> reproduce easily on my end, can you see if the box you have acecss to
>>>> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?
>>>
>>> It's hitting it, and doing the worker pre-create like in the patch I
>>> sent you makes the bug go away here.
>>
>> That's what I thought.  OK, let me chew on this for a bit.  I'm not hitting it
>> at all, my worker creation goes through the deferred path only hence why adding
>> the barrier to the IPI fully resolved the problem on my end.
> 
> It's a stab in the dark, but does adding smb_mb() right before the
> io_wq_create_worker() call fix anything?

Sure, let's try it.

> What mariadb version are you testing with?  I can't seem to get into
> the io_wq_enqueue() path at all.

11.0.4-MariaDB here. You most certainly should be hitting that, unless
you're still using some of my previous patches.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 18:02                                                                                                                             ` Jens Axboe
@ 2023-11-14 18:12                                                                                                                               ` Timothy Pearson
  2023-11-14 18:26                                                                                                                                 ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-14 18:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Timothy Pearson, regressions, Pavel Begunkov



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Tuesday, November 14, 2023 12:02:56 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/14/23 10:57 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> To: "Jens Axboe" <axboe@kernel.dk>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Tuesday, November 14, 2023 11:21:32 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 14, 2023 11:17:14 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/14/23 10:14 AM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>> <asml.silence@gmail.com>
>>>>>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>
>>>>>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>
>>>>>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>>>>>
>>>>>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>>>>>> stomping on userspace for now.
>>>>>>>>>
>>>>>>>>> Finally found it!  Patch here:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>>>>>
>>>>>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>>>>>> queued io worker creation.  The explaination for how userspace was
>>>>>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>>>>>> seen the two cores involved in the worker handoff and new worker
>>>>>>>>> creation.
>>>>>>>>>
>>>>>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>>>>>> one's hair out? ;)
>>>>>>>>
>>>>>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>>>>>
>>>>>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>>>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>>>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>>>>>> --repeat=500
>>>>>>>> [...]
>>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>>>>>        Test ended at 2023-11-14 13:02:52
>>>>>>>>
>>>>>>>> CURRENT_TEST: encryption.innodb_encryption
>>>>>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>>>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>>>>>> Unknown system variable 'innodb_encryption_threads'
>>>>>>>>
>>>>>>>> The result from queries just before the failure was:
>>>>>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>>>>>
>>>>>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>>>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>>>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>>>>>> encryption.innodb_encryption
>>>>>>>>
>>>>>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>>>>>> level experimental while the server is stable
>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>>>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>>>>>> may have to recover from a backup.
>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>>>>>> Page read from tablespace is corrupted.
>>>>>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>>>>>> failed.
>>>>>>>>
>>>>>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>>>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>>>>>> still hit it.
>>>>>>>>
>>>>>>>> If it makes a difference for you, it must be close. But don't think it's
>>>>>>>> quite there yet, unfortunately :-(
>>>>>>>
>>>>>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>>>>>> failures.  Let me try on the shared test box.
>>>>>>
>>>>>> So odd. I guess it must be close if it's enough to make yours sane. The
>>>>>> other vm I've been using is just a small vm somewhere else, it has 8
>>>>>> threads of:
>>>>>>
>>>>>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>>>>>> processor	: 0
>>>>>> cpu		: POWER9 (architected), altivec supported
>>>>>> clock		: 2200.000000MHz
>>>>>> revision	: 2.2 (pvr 004e 1202)
>>>>>> [...]
>>>>>> timebase	: 512000000
>>>>>> platform	: pSeries
>>>>>> model		: IBM pSeries (emulated by qemu)
>>>>>> machine		: CHRP IBM pSeries (emulated by qemu)
>>>>>> MMU		: Radix
>>>>>
>>>>> At this point my concern is there might be two bugs, one related to
>>>>> the IPI and one related to the direct creation path.  Since I can't
>>>>> reproduce easily on my end, can you see if the box you have acecss to
>>>>> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?
>>>>
>>>> It's hitting it, and doing the worker pre-create like in the patch I
>>>> sent you makes the bug go away here.
>>>
>>> That's what I thought.  OK, let me chew on this for a bit.  I'm not hitting it
>>> at all, my worker creation goes through the deferred path only hence why adding
>>> the barrier to the IPI fully resolved the problem on my end.
>> 
>> It's a stab in the dark, but does adding smb_mb() right before the
>> io_wq_create_worker() call fix anything?
> 
> Sure, let's try it.
> 
>> What mariadb version are you testing with?  I can't seem to get into
>> the io_wq_enqueue() path at all.
> 
> 11.0.4-MariaDB here. You most certainly should be hitting that, unless
> you're still using some of my previous patches.

Let me check.  I'm pretty sure I had an old patch in-tree on the branch I was using to create the diff against.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 18:12                                                                                                                               ` Timothy Pearson
@ 2023-11-14 18:26                                                                                                                                 ` Jens Axboe
  2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-14 18:26 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov

On 11/14/23 11:12 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Tuesday, November 14, 2023 12:02:56 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/14/23 10:57 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Tuesday, November 14, 2023 11:21:32 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Tuesday, November 14, 2023 11:17:14 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/14/23 10:14 AM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Tuesday, November 14, 2023 11:04:03 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/14/23 9:59 AM, Timothy Pearson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Tuesday, November 14, 2023 7:17:22 AM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>>> On 11/13/23 10:06 PM, Timothy Pearson wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> To: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>>> Sent: Monday, November 13, 2023 6:52:51 PM
>>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>>
>>>>>>>>>>> Seemed to do what I wanted, however corruption remains.  Was worth a try...
>>>>>>>>>>>
>>>>>>>>>>> I guess I'll proceed under the assumption that somehow the thread setup is
>>>>>>>>>>> stomping on userspace for now.
>>>>>>>>>>
>>>>>>>>>> Finally found it!  Patch here:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/regressions/19221908.47168775.1699937769845.JavaMail.zimbra@raptorengineeringinc.com/T/#u
>>>>>>>>>>
>>>>>>>>>> We were missing the appropriate barrier instruction inside the IPI
>>>>>>>>>> issued by the task_work_add() call, which was in turn issued during
>>>>>>>>>> queued io worker creation.  The explaination for how userspace was
>>>>>>>>>> getting corrupted boils down to an inconsistent view of main memory as
>>>>>>>>>> seen the two cores involved in the worker handoff and new worker
>>>>>>>>>> creation.
>>>>>>>>>>
>>>>>>>>>> Who doesn't love a one line fix after a week and a half of pulling
>>>>>>>>>> one's hair out? ;)
>>>>>>>>>
>>>>>>>>> Hate to be a debbie downer, but it still fails for me with that patch:
>>>>>>>>>
>>>>>>>>> debian@linux-kernel--io-uring:~/git/server/mysql-test$ ./mtr
>>>>>>>>> --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1
>>>>>>>>> --vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0
>>>>>>>>> --repeat=500
>>>>>>>>> [...]
>>>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3101
>>>>>>>>> encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
>>>>>>>>>        Test ended at 2023-11-14 13:02:52
>>>>>>>>>
>>>>>>>>> CURRENT_TEST: encryption.innodb_encryption
>>>>>>>>> mysqltest: At line 11: query 'SET @start_global_value =
>>>>>>>>> @@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193):
>>>>>>>>> Unknown system variable 'innodb_encryption_threads'
>>>>>>>>>
>>>>>>>>> The result from queries just before the failure was:
>>>>>>>>> SET @start_global_value = @@global.innodb_encryption_threads;
>>>>>>>>>
>>>>>>>>> - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to
>>>>>>>>> '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
>>>>>>>>> ***Warnings generated in error logs during shutdown after running tests:
>>>>>>>>> encryption.innodb_encryption
>>>>>>>>>
>>>>>>>>> 2023-11-14 13:02:51 0 [Warning] Plugin 'example_key_management' is of maturity
>>>>>>>>> level experimental while the server is stable
>>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Database page corruption on disk or a
>>>>>>>>> failed read of file './ibdata1' page [page id: space=0, page number=220]. You
>>>>>>>>> may have to recover from a backup.
>>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: File './ibdata1' is corrupted
>>>>>>>>> 2023-11-14 13:02:51 0 [ERROR] InnoDB: Plugin initialization aborted with error
>>>>>>>>> Page read from tablespace is corrupted.
>>>>>>>>> 2023-11-14 13:02:52 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
>>>>>>>>> failed.
>>>>>>>>>
>>>>>>>>> after about 40 loops. How consistent is it on your end? I know I've
>>>>>>>>> suspected the signaling before and tried the NO_IPI variant, but that
>>>>>>>>> still hit it.
>>>>>>>>>
>>>>>>>>> If it makes a difference for you, it must be close. But don't think it's
>>>>>>>>> quite there yet, unfortunately :-(
>>>>>>>>
>>>>>>>> Interesting!  I've run it for thousands of loops on my test VM, no
>>>>>>>> failures.  Let me try on the shared test box.
>>>>>>>
>>>>>>> So odd. I guess it must be close if it's enough to make yours sane. The
>>>>>>> other vm I've been using is just a small vm somewhere else, it has 8
>>>>>>> threads of:
>>>>>>>
>>>>>>> debian@linux-kernel--io-uring:~$ cat /proc/cpuinfo
>>>>>>> processor	: 0
>>>>>>> cpu		: POWER9 (architected), altivec supported
>>>>>>> clock		: 2200.000000MHz
>>>>>>> revision	: 2.2 (pvr 004e 1202)
>>>>>>> [...]
>>>>>>> timebase	: 512000000
>>>>>>> platform	: pSeries
>>>>>>> model		: IBM pSeries (emulated by qemu)
>>>>>>> machine		: CHRP IBM pSeries (emulated by qemu)
>>>>>>> MMU		: Radix
>>>>>>
>>>>>> At this point my concern is there might be two bugs, one related to
>>>>>> the IPI and one related to the direct creation path.  Since I can't
>>>>>> reproduce easily on my end, can you see if the box you have acecss to
>>>>>> is ever hitting the io_wq_create_worker() call inside io_wq_enqueue()?
>>>>>
>>>>> It's hitting it, and doing the worker pre-create like in the patch I
>>>>> sent you makes the bug go away here.
>>>>
>>>> That's what I thought.  OK, let me chew on this for a bit.  I'm not hitting it
>>>> at all, my worker creation goes through the deferred path only hence why adding
>>>> the barrier to the IPI fully resolved the problem on my end.
>>>
>>> It's a stab in the dark, but does adding smb_mb() right before the
>>> io_wq_create_worker() call fix anything?
>>
>> Sure, let's try it.

I tested this, still happens with that change added and the IPI lwsync
retained.

>>> What mariadb version are you testing with?  I can't seem to get into
>>> the io_wq_enqueue() path at all.
>>
>> 11.0.4-MariaDB here. You most certainly should be hitting that, unless
>> you're still using some of my previous patches.
> 
> Let me check.  I'm pretty sure I had an old patch in-tree on the
> branch I was using to create the diff against.

OK

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-14 18:26                                                                                                                                 ` Jens Axboe
@ 2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
  2023-11-15 16:46                                                                                                                                     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-15 11:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

I haven't had much success in getting the IPI path to work properly, but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at least able to narrow down one of the areas that's going wrong.  Bear in mind that as soon as I reenable IPI the corruption returns with a vengeance, so this is not the correct fix yet by any means -- I am currently soliciting feedback on what else might be going wrong at this point since I've already spent a couple of weeks on this and am not sure how much more time I can spend before we just have to shut io_uring down on ppc64 for the forseeable future.

Whatever the root cause actually is, something is *very* sensitive to timing in both the worker thread creation path and the io_queue_sqe() / io_queue_async() paths.  I can make the corruption disappear by adding a udelay(1000) before io_queue_async() in the io_queue_sqe() function, however no amount of memory barriers in the io_queue_async() path (including in the kbuf recycling code) will fully resolve the problem.

Jens, would a small delay like that in io_queue_sqe() reduce the amount of workers being created overall?  I know with some of the other delay locations worker allocation was changing, from what I see this one wouldn't seem to have much effect, but I'm still looking for a sanity check.  If we're needing to wait for a millisecond for some other thread to complete before moving on that might be valuable information -- would also potentially tie in to the IPI path still malfunctioning as the worker would immediately start executing.

On a related note, how is inter-thread safety of the io_kiocb buffer list guaranteed, especially on weak memory model systems?  As I understand it, different workers running on different cores could potentially be interacting with the same kiocb request and the same buffer list, and that does dovetail with the fact that punting to a different I/O worker (usually on another core) seems to provoke the problem.  I tried adding memory barriers to some of the basic recycle functions without too much success -- it seemed to help somewhat, but nowhere near complete resolution, and the buffers are used in a number of other places I didn't even try to poke at.  I wanted to get some feedback on this concept before going down yet another rabbit hole...

Thoughts very much welcome, I don't have many hairs left to pull out! ;)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
@ 2023-11-15 16:46                                                                                                                                     ` Jens Axboe
  2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-15 16:46 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

On 11/15/23 4:03 AM, Timothy Pearson wrote:
> I haven't had much success in getting the IPI path to work properly,
> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
> least able to narrow down one of the areas that's going wrong.  Bear
> in mind that as soon as I reenable IPI the corruption returns with a
> vengeance, so this is not the correct fix yet by any means -- I am
> currently soliciting feedback on what else might be going wrong at
> this point since I've already spent a couple of weeks on this and am
> not sure how much more time I can spend before we just have to shut
> io_uring down on ppc64 for the forseeable future.
> 
> Whatever the root cause actually is, something is *very* sensitive to
> timing in both the worker thread creation path and the io_queue_sqe()
> / io_queue_async() paths.  I can make the corruption disappear by
> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
> function, however no amount of memory barriers in the io_queue_async()
> path (including in the kbuf recycling code) will fully resolve the
> problem.
> 
> Jens, would a small delay like that in io_queue_sqe() reduce the
> amount of workers being created overall?  I know with some of the
> other delay locations worker allocation was changing, from what I see
> this one wouldn't seem to have much effect, but I'm still looking for
> a sanity check.  If we're needing to wait for a millisecond for some
> other thread to complete before moving on that might be valuable
> information -- would also potentially tie in to the IPI path still
> malfunctioning as the worker would immediately start executing.

If io_queue_sqe() ultimately ends up punting to io-wq for this request,
then yes doing a 1ms delay in there would ultimately then need to a 1ms
delay before we either pass to an existing worker or create a new one.

> On a related note, how is inter-thread safety of the io_kiocb buffer
> list guaranteed, especially on weak memory model systems?  As I
> understand it, different workers running on different cores could
> potentially be interacting with the same kiocb request and the same
> buffer list, and that does dovetail with the fact that punting to a
> different I/O worker (usually on another core) seems to provoke the
> problem.  I tried adding memory barriers to some of the basic recycle
> functions without too much success -- it seemed to help somewhat, but
> nowhere near complete resolution, and the buffers are used in a number
> of other places I didn't even try to poke at.  I wanted to get some
> feedback on this concept before going down yet another rabbit hole...

This relies on the fact that we grab the wq lock before inserting this
work, and the unlocking will be a barrier. It's important to note that
this isn't any different than from before io-wq was using native
workers, the only difference is that it used to be kthreads before, and
now it's native threads to the application. The kthreads did a bunch of
work to assume the necessary identity to do the read or write operation
(which is ultimately why that approach went away, as it was just
inherently unsafe), whereas the native threads do not as they already
have what they need.

I had a patch that just punted to a kthread and did the necessary
kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
that point. Within the existing code...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 16:46                                                                                                                                     ` Jens Axboe
@ 2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
  2023-11-15 18:30                                                                                                                                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-15 17:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Timothy Pearson, regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 10:46:58 AM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>> I haven't had much success in getting the IPI path to work properly,
>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>> least able to narrow down one of the areas that's going wrong.  Bear
>> in mind that as soon as I reenable IPI the corruption returns with a
>> vengeance, so this is not the correct fix yet by any means -- I am
>> currently soliciting feedback on what else might be going wrong at
>> this point since I've already spent a couple of weeks on this and am
>> not sure how much more time I can spend before we just have to shut
>> io_uring down on ppc64 for the forseeable future.
>> 
>> Whatever the root cause actually is, something is *very* sensitive to
>> timing in both the worker thread creation path and the io_queue_sqe()
>> / io_queue_async() paths.  I can make the corruption disappear by
>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>> function, however no amount of memory barriers in the io_queue_async()
>> path (including in the kbuf recycling code) will fully resolve the
>> problem.
>> 
>> Jens, would a small delay like that in io_queue_sqe() reduce the
>> amount of workers being created overall?  I know with some of the
>> other delay locations worker allocation was changing, from what I see
>> this one wouldn't seem to have much effect, but I'm still looking for
>> a sanity check.  If we're needing to wait for a millisecond for some
>> other thread to complete before moving on that might be valuable
>> information -- would also potentially tie in to the IPI path still
>> malfunctioning as the worker would immediately start executing.
> 
> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
> then yes doing a 1ms delay in there would ultimately then need to a 1ms
> delay before we either pass to an existing worker or create a new one.
> 
>> On a related note, how is inter-thread safety of the io_kiocb buffer
>> list guaranteed, especially on weak memory model systems?  As I
>> understand it, different workers running on different cores could
>> potentially be interacting with the same kiocb request and the same
>> buffer list, and that does dovetail with the fact that punting to a
>> different I/O worker (usually on another core) seems to provoke the
>> problem.  I tried adding memory barriers to some of the basic recycle
>> functions without too much success -- it seemed to help somewhat, but
>> nowhere near complete resolution, and the buffers are used in a number
>> of other places I didn't even try to poke at.  I wanted to get some
>> feedback on this concept before going down yet another rabbit hole...
> 
> This relies on the fact that we grab the wq lock before inserting this
> work, and the unlocking will be a barrier. It's important to note that
> this isn't any different than from before io-wq was using native
> workers, the only difference is that it used to be kthreads before, and
> now it's native threads to the application. The kthreads did a bunch of
> work to assume the necessary identity to do the read or write operation
> (which is ultimately why that approach went away, as it was just
> inherently unsafe), whereas the native threads do not as they already
> have what they need.
> 
> I had a patch that just punted to a kthread and did the necessary
> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
> that point. Within the existing code...

Would you happen to have that patch still?  It would provide a possible starting point for figuring out the exact difference.  If not I guess I could hack something similar up.

Thanks!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
@ 2023-11-15 18:30                                                                                                                                         ` Jens Axboe
  2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
  2023-11-15 19:00                                                                                                                                           ` Jens Axboe
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2023-11-15 18:30 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

On 11/15/23 10:03 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>> I haven't had much success in getting the IPI path to work properly,
>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>> least able to narrow down one of the areas that's going wrong.  Bear
>>> in mind that as soon as I reenable IPI the corruption returns with a
>>> vengeance, so this is not the correct fix yet by any means -- I am
>>> currently soliciting feedback on what else might be going wrong at
>>> this point since I've already spent a couple of weeks on this and am
>>> not sure how much more time I can spend before we just have to shut
>>> io_uring down on ppc64 for the forseeable future.
>>>
>>> Whatever the root cause actually is, something is *very* sensitive to
>>> timing in both the worker thread creation path and the io_queue_sqe()
>>> / io_queue_async() paths.  I can make the corruption disappear by
>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>> function, however no amount of memory barriers in the io_queue_async()
>>> path (including in the kbuf recycling code) will fully resolve the
>>> problem.
>>>
>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>> amount of workers being created overall?  I know with some of the
>>> other delay locations worker allocation was changing, from what I see
>>> this one wouldn't seem to have much effect, but I'm still looking for
>>> a sanity check.  If we're needing to wait for a millisecond for some
>>> other thread to complete before moving on that might be valuable
>>> information -- would also potentially tie in to the IPI path still
>>> malfunctioning as the worker would immediately start executing.
>>
>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>> delay before we either pass to an existing worker or create a new one.
>>
>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>> list guaranteed, especially on weak memory model systems?  As I
>>> understand it, different workers running on different cores could
>>> potentially be interacting with the same kiocb request and the same
>>> buffer list, and that does dovetail with the fact that punting to a
>>> different I/O worker (usually on another core) seems to provoke the
>>> problem.  I tried adding memory barriers to some of the basic recycle
>>> functions without too much success -- it seemed to help somewhat, but
>>> nowhere near complete resolution, and the buffers are used in a number
>>> of other places I didn't even try to poke at.  I wanted to get some
>>> feedback on this concept before going down yet another rabbit hole...
>>
>> This relies on the fact that we grab the wq lock before inserting this
>> work, and the unlocking will be a barrier. It's important to note that
>> this isn't any different than from before io-wq was using native
>> workers, the only difference is that it used to be kthreads before, and
>> now it's native threads to the application. The kthreads did a bunch of
>> work to assume the necessary identity to do the read or write operation
>> (which is ultimately why that approach went away, as it was just
>> inherently unsafe), whereas the native threads do not as they already
>> have what they need.
>>
>> I had a patch that just punted to a kthread and did the necessary
>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>> that point. Within the existing code...
> 
> Would you happen to have that patch still?  It would provide a
> possible starting point for figuring out the exact difference.  If not
> I guess I could hack something similar up.

Let me see if I can find it, and make sure it applies on the current
tree. I'll send you one in a bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 18:30                                                                                                                                         ` Jens Axboe
@ 2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
  2023-11-15 18:37                                                                                                                                             ` Jens Axboe
  2023-11-15 19:00                                                                                                                                           ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-15 18:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 12:30:15 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>> I haven't had much success in getting the IPI path to work properly,
>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>> currently soliciting feedback on what else might be going wrong at
>>>> this point since I've already spent a couple of weeks on this and am
>>>> not sure how much more time I can spend before we just have to shut
>>>> io_uring down on ppc64 for the forseeable future.
>>>>
>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>> function, however no amount of memory barriers in the io_queue_async()
>>>> path (including in the kbuf recycling code) will fully resolve the
>>>> problem.
>>>>
>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>> amount of workers being created overall?  I know with some of the
>>>> other delay locations worker allocation was changing, from what I see
>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>> other thread to complete before moving on that might be valuable
>>>> information -- would also potentially tie in to the IPI path still
>>>> malfunctioning as the worker would immediately start executing.
>>>
>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>> delay before we either pass to an existing worker or create a new one.
>>>
>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>> list guaranteed, especially on weak memory model systems?  As I
>>>> understand it, different workers running on different cores could
>>>> potentially be interacting with the same kiocb request and the same
>>>> buffer list, and that does dovetail with the fact that punting to a
>>>> different I/O worker (usually on another core) seems to provoke the
>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>> functions without too much success -- it seemed to help somewhat, but
>>>> nowhere near complete resolution, and the buffers are used in a number
>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>> feedback on this concept before going down yet another rabbit hole...
>>>
>>> This relies on the fact that we grab the wq lock before inserting this
>>> work, and the unlocking will be a barrier. It's important to note that
>>> this isn't any different than from before io-wq was using native
>>> workers, the only difference is that it used to be kthreads before, and
>>> now it's native threads to the application. The kthreads did a bunch of
>>> work to assume the necessary identity to do the read or write operation
>>> (which is ultimately why that approach went away, as it was just
>>> inherently unsafe), whereas the native threads do not as they already
>>> have what they need.
>>>
>>> I had a patch that just punted to a kthread and did the necessary
>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>> that point. Within the existing code...
>> 
>> Would you happen to have that patch still?  It would provide a
>> possible starting point for figuring out the exact difference.  If not
>> I guess I could hack something similar up.
> 
> Let me see if I can find it, and make sure it applies on the current
> tree. I'll send you one in a bit.

Much appreciated.

New question -- should a user of liburing be able to oops the kernel under any circumstances (userspace access / NULL pointer dereference)?  I've started putting together a torture test application and hit ... something ... almost right away .  If this isn't supposed to happen I'll send more details off-list for a double check.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
@ 2023-11-15 18:37                                                                                                                                             ` Jens Axboe
  2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-15 18:37 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

On 11/15/23 11:35 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>> Sent: Wednesday, November 15, 2023 12:30:15 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>> currently soliciting feedback on what else might be going wrong at
>>>>> this point since I've already spent a couple of weeks on this and am
>>>>> not sure how much more time I can spend before we just have to shut
>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>
>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>> problem.
>>>>>
>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>> amount of workers being created overall?  I know with some of the
>>>>> other delay locations worker allocation was changing, from what I see
>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>> other thread to complete before moving on that might be valuable
>>>>> information -- would also potentially tie in to the IPI path still
>>>>> malfunctioning as the worker would immediately start executing.
>>>>
>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>> delay before we either pass to an existing worker or create a new one.
>>>>
>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>> understand it, different workers running on different cores could
>>>>> potentially be interacting with the same kiocb request and the same
>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>
>>>> This relies on the fact that we grab the wq lock before inserting this
>>>> work, and the unlocking will be a barrier. It's important to note that
>>>> this isn't any different than from before io-wq was using native
>>>> workers, the only difference is that it used to be kthreads before, and
>>>> now it's native threads to the application. The kthreads did a bunch of
>>>> work to assume the necessary identity to do the read or write operation
>>>> (which is ultimately why that approach went away, as it was just
>>>> inherently unsafe), whereas the native threads do not as they already
>>>> have what they need.
>>>>
>>>> I had a patch that just punted to a kthread and did the necessary
>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>> that point. Within the existing code...
>>>
>>> Would you happen to have that patch still?  It would provide a
>>> possible starting point for figuring out the exact difference.  If not
>>> I guess I could hack something similar up.
>>
>> Let me see if I can find it, and make sure it applies on the current
>> tree. I'll send you one in a bit.
> 
> Much appreciated.
> 
> New question -- should a user of liburing be able to oops the kernel
> under any circumstances (userspace access / NULL pointer dereference)?
> I've started putting together a torture test application and hit ...
> something ... almost right away .  If this isn't supposed to happen
> I'll send more details off-list for a double check.

No, certainly not. Please do send me the details. I'm assuming this is
on an unmodified kernel, some of the patches that have been flung around
in this thread are definitely not generally sane, it's just that knowing
the context of what is being used makes that fine for test purposes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 18:37                                                                                                                                             ` Jens Axboe
@ 2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
  0 siblings, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-15 18:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 12:37:31 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 11:35 AM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 12:30:15 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>>> currently soliciting feedback on what else might be going wrong at
>>>>>> this point since I've already spent a couple of weeks on this and am
>>>>>> not sure how much more time I can spend before we just have to shut
>>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>>
>>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>>> problem.
>>>>>>
>>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>>> amount of workers being created overall?  I know with some of the
>>>>>> other delay locations worker allocation was changing, from what I see
>>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>>> other thread to complete before moving on that might be valuable
>>>>>> information -- would also potentially tie in to the IPI path still
>>>>>> malfunctioning as the worker would immediately start executing.
>>>>>
>>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>>> delay before we either pass to an existing worker or create a new one.
>>>>>
>>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>>> understand it, different workers running on different cores could
>>>>>> potentially be interacting with the same kiocb request and the same
>>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>>
>>>>> This relies on the fact that we grab the wq lock before inserting this
>>>>> work, and the unlocking will be a barrier. It's important to note that
>>>>> this isn't any different than from before io-wq was using native
>>>>> workers, the only difference is that it used to be kthreads before, and
>>>>> now it's native threads to the application. The kthreads did a bunch of
>>>>> work to assume the necessary identity to do the read or write operation
>>>>> (which is ultimately why that approach went away, as it was just
>>>>> inherently unsafe), whereas the native threads do not as they already
>>>>> have what they need.
>>>>>
>>>>> I had a patch that just punted to a kthread and did the necessary
>>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>>> that point. Within the existing code...
>>>>
>>>> Would you happen to have that patch still?  It would provide a
>>>> possible starting point for figuring out the exact difference.  If not
>>>> I guess I could hack something similar up.
>>>
>>> Let me see if I can find it, and make sure it applies on the current
>>> tree. I'll send you one in a bit.
>> 
>> Much appreciated.
>> 
>> New question -- should a user of liburing be able to oops the kernel
>> under any circumstances (userspace access / NULL pointer dereference)?
>> I've started putting together a torture test application and hit ...
>> something ... almost right away .  If this isn't supposed to happen
>> I'll send more details off-list for a double check.
> 
> No, certainly not. Please do send me the details. I'm assuming this is
> on an unmodified kernel, some of the patches that have been flung around
> in this thread are definitely not generally sane, it's just that knowing
> the context of what is being used makes that fine for test purposes.

I need to verify the kernel is in fact functionally unmodified, I've been caught out by my multiple test trees before.  Just wanted to verify this was the case before I check the kernel and try to put together a more reliable reproducer.

The location is exactly where I'd expect a problem with thread setup, and it's timing dependent...sound familiar? ;)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 18:30                                                                                                                                         ` Jens Axboe
  2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
@ 2023-11-15 19:00                                                                                                                                           ` Jens Axboe
  2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-15 19:00 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

On 11/15/23 11:30 AM, Jens Axboe wrote:
> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>> I haven't had much success in getting the IPI path to work properly,
>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>> currently soliciting feedback on what else might be going wrong at
>>>> this point since I've already spent a couple of weeks on this and am
>>>> not sure how much more time I can spend before we just have to shut
>>>> io_uring down on ppc64 for the forseeable future.
>>>>
>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>> function, however no amount of memory barriers in the io_queue_async()
>>>> path (including in the kbuf recycling code) will fully resolve the
>>>> problem.
>>>>
>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>> amount of workers being created overall?  I know with some of the
>>>> other delay locations worker allocation was changing, from what I see
>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>> other thread to complete before moving on that might be valuable
>>>> information -- would also potentially tie in to the IPI path still
>>>> malfunctioning as the worker would immediately start executing.
>>>
>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>> delay before we either pass to an existing worker or create a new one.
>>>
>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>> list guaranteed, especially on weak memory model systems?  As I
>>>> understand it, different workers running on different cores could
>>>> potentially be interacting with the same kiocb request and the same
>>>> buffer list, and that does dovetail with the fact that punting to a
>>>> different I/O worker (usually on another core) seems to provoke the
>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>> functions without too much success -- it seemed to help somewhat, but
>>>> nowhere near complete resolution, and the buffers are used in a number
>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>> feedback on this concept before going down yet another rabbit hole...
>>>
>>> This relies on the fact that we grab the wq lock before inserting this
>>> work, and the unlocking will be a barrier. It's important to note that
>>> this isn't any different than from before io-wq was using native
>>> workers, the only difference is that it used to be kthreads before, and
>>> now it's native threads to the application. The kthreads did a bunch of
>>> work to assume the necessary identity to do the read or write operation
>>> (which is ultimately why that approach went away, as it was just
>>> inherently unsafe), whereas the native threads do not as they already
>>> have what they need.
>>>
>>> I had a patch that just punted to a kthread and did the necessary
>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>> that point. Within the existing code...
>>
>> Would you happen to have that patch still?  It would provide a
>> possible starting point for figuring out the exact difference.  If not
>> I guess I could hack something similar up.
> 
> Let me see if I can find it, and make sure it applies on the current
> tree. I'll send you one in a bit.

Wrote a new one. This one has two different ways it can work:

1) By default, it uses the native io workers still, but rather than add
it to a list of pending items, it creates a new worker for each work
item. This means all writes that would've gone to io-wq will now just
fork a native worker and perform the write in a blocking fashion.

2) The fallback path for the above is that we punt it to a kthread,
which does the mm dance. This is similar to what we did before the
native workers. The fallback path is only hit if the worker creation
fails, but you can make it happen every time by just uncommenting that
return 1 in io_rewrite_io_thread().

The interesting thing about approach 1 is that while it still uses the
native workers, it will not need to be processing task_work and hence
signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
forks off a new worker every time, which does the work, then exits.

First try it as-is and see if that reproduces the issue. If it does,
then try uncommenting that return 1 mentioned in #2 above.


diff --git a/io_uring/rw.c b/io_uring/rw.c
index 64390d4e20c1..77e408bdb169 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -968,6 +968,8 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
+static int io_rewrite_queue(struct io_kiocb *req);
+
 int io_write(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
@@ -1071,7 +1073,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 
 			if (kiocb->ki_flags & IOCB_WRITE)
 				io_req_end_write(req);
-			return ret ? ret : -EAGAIN;
+			if (io_rewrite_queue(req))
+				return -EAGAIN;
+			return IOU_ISSUE_SKIP_COMPLETE;
 		}
 done:
 		ret = kiocb_done(req, ret2, issue_flags);
@@ -1082,7 +1086,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (!ret) {
 			if (kiocb->ki_flags & IOCB_WRITE)
 				io_req_end_write(req);
-			return -EAGAIN;
+			if (io_rewrite_queue(req))
+				return -EAGAIN;
+			return IOU_ISSUE_SKIP_COMPLETE;
 		}
 		return ret;
 	}
@@ -1092,6 +1098,79 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+struct koffload {
+	struct work_struct work;
+	struct io_kiocb *req;
+	struct mm_struct *mm;
+};
+
+static void io_rewrite(struct work_struct *work)
+{
+	struct koffload *k = container_of(work, struct koffload, work);
+	unsigned issue_flags = IO_URING_F_UNLOCKED;
+	int ret;
+
+	kthread_use_mm(k->mm);
+	ret = io_write(k->req, issue_flags);
+	kthread_unuse_mm(k->mm);
+	mmput(k->mm);
+
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_req_complete_post(k->req, issue_flags);
+	kfree(k);
+}
+
+static int io_write_io_thread(void *data)
+{
+	struct io_kiocb *req = data;
+	unsigned issue_flags = IO_URING_F_UNLOCKED;
+	int ret;
+
+	ret = io_write(req, issue_flags);
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_req_complete_post(req, issue_flags);
+
+	do_exit(0);
+}
+
+static int io_rewrite_io_thread(struct io_kiocb *req)
+{
+	struct task_struct *tsk;
+
+	/*
+	 * Uncomment this one to ALWAYS punt to a kthread
+	 */
+	// return 1;
+
+	tsk = create_io_thread(io_write_io_thread, req, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		wake_up_new_task(tsk);
+		return 0;
+	}
+
+	printk("%s: err=%ld\n", __FUNCTION__, PTR_ERR(tsk));
+	return 1;
+}
+
+static int io_rewrite_queue(struct io_kiocb *req)
+{
+	struct koffload *k;
+
+	if (!io_rewrite_io_thread(req))
+		return 0;
+
+	k = kmalloc(sizeof(*k), GFP_NOIO);
+	if (!k)
+		return 1;
+
+	INIT_WORK(&k->work, io_rewrite);
+	k->req = req;
+	mmget(current->mm);
+	k->mm = current->mm;
+	queue_work(system_wq, &k->work);
+	return 0;
+}
+
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-15 19:00                                                                                                                                           ` Jens Axboe
@ 2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
  2023-11-16  3:46                                                                                                                                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-16  3:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 1:00:53 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 11:30 AM, Jens Axboe wrote:
>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>> currently soliciting feedback on what else might be going wrong at
>>>>> this point since I've already spent a couple of weeks on this and am
>>>>> not sure how much more time I can spend before we just have to shut
>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>
>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>> problem.
>>>>>
>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>> amount of workers being created overall?  I know with some of the
>>>>> other delay locations worker allocation was changing, from what I see
>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>> other thread to complete before moving on that might be valuable
>>>>> information -- would also potentially tie in to the IPI path still
>>>>> malfunctioning as the worker would immediately start executing.
>>>>
>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>> delay before we either pass to an existing worker or create a new one.
>>>>
>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>> understand it, different workers running on different cores could
>>>>> potentially be interacting with the same kiocb request and the same
>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>
>>>> This relies on the fact that we grab the wq lock before inserting this
>>>> work, and the unlocking will be a barrier. It's important to note that
>>>> this isn't any different than from before io-wq was using native
>>>> workers, the only difference is that it used to be kthreads before, and
>>>> now it's native threads to the application. The kthreads did a bunch of
>>>> work to assume the necessary identity to do the read or write operation
>>>> (which is ultimately why that approach went away, as it was just
>>>> inherently unsafe), whereas the native threads do not as they already
>>>> have what they need.
>>>>
>>>> I had a patch that just punted to a kthread and did the necessary
>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>> that point. Within the existing code...
>>>
>>> Would you happen to have that patch still?  It would provide a
>>> possible starting point for figuring out the exact difference.  If not
>>> I guess I could hack something similar up.
>> 
>> Let me see if I can find it, and make sure it applies on the current
>> tree. I'll send you one in a bit.
> 
> Wrote a new one. This one has two different ways it can work:
> 
> 1) By default, it uses the native io workers still, but rather than add
> it to a list of pending items, it creates a new worker for each work
> item. This means all writes that would've gone to io-wq will now just
> fork a native worker and perform the write in a blocking fashion.
> 
> 2) The fallback path for the above is that we punt it to a kthread,
> which does the mm dance. This is similar to what we did before the
> native workers. The fallback path is only hit if the worker creation
> fails, but you can make it happen every time by just uncommenting that
> return 1 in io_rewrite_io_thread().
> 
> The interesting thing about approach 1 is that while it still uses the
> native workers, it will not need to be processing task_work and hence
> signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
> forks off a new worker every time, which does the work, then exits.
> 
> First try it as-is and see if that reproduces the issue. If it does,
> then try uncommenting that return 1 mentioned in #2 above.

OK, so those two test cases worked, *but* I would have expected that, since when we go through the system workqueue (with its associated delays) it is functionally the same as inserting the udelay(1000) at the start of io_write().  I suspect the root issue predates the move to worker-managed I/O threads and that it was exposed simply by (inadvertently) shortening the delay between the initial io_write() call and the subsequent thread start.  Then, when the code started issuing IPIs to kick thread start even faster we ended up racing with this other mysterious process almost every time instead of somewhat rarely, basically compounding the problem and making analysis next to impossible until the IPI was skipped via the TWA_SIGNAL_NO_IPI flag.

I'm unsure if this is related to the oops I sent off-list, but I suspect it very well might be given a delay in the same location works around the corruption.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
@ 2023-11-16  3:46                                                                                                                                               ` Jens Axboe
  2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2023-11-16  3:46 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin

On 11/15/23 8:28 PM, Timothy Pearson wrote:
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>> Sent: Wednesday, November 15, 2023 1:00:53 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/15/23 11:30 AM, Jens Axboe wrote:
>>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>>> currently soliciting feedback on what else might be going wrong at
>>>>>> this point since I've already spent a couple of weeks on this and am
>>>>>> not sure how much more time I can spend before we just have to shut
>>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>>
>>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>>> problem.
>>>>>>
>>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>>> amount of workers being created overall?  I know with some of the
>>>>>> other delay locations worker allocation was changing, from what I see
>>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>>> other thread to complete before moving on that might be valuable
>>>>>> information -- would also potentially tie in to the IPI path still
>>>>>> malfunctioning as the worker would immediately start executing.
>>>>>
>>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>>> delay before we either pass to an existing worker or create a new one.
>>>>>
>>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>>> understand it, different workers running on different cores could
>>>>>> potentially be interacting with the same kiocb request and the same
>>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>>
>>>>> This relies on the fact that we grab the wq lock before inserting this
>>>>> work, and the unlocking will be a barrier. It's important to note that
>>>>> this isn't any different than from before io-wq was using native
>>>>> workers, the only difference is that it used to be kthreads before, and
>>>>> now it's native threads to the application. The kthreads did a bunch of
>>>>> work to assume the necessary identity to do the read or write operation
>>>>> (which is ultimately why that approach went away, as it was just
>>>>> inherently unsafe), whereas the native threads do not as they already
>>>>> have what they need.
>>>>>
>>>>> I had a patch that just punted to a kthread and did the necessary
>>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>>> that point. Within the existing code...
>>>>
>>>> Would you happen to have that patch still?  It would provide a
>>>> possible starting point for figuring out the exact difference.  If not
>>>> I guess I could hack something similar up.
>>>
>>> Let me see if I can find it, and make sure it applies on the current
>>> tree. I'll send you one in a bit.
>>
>> Wrote a new one. This one has two different ways it can work:
>>
>> 1) By default, it uses the native io workers still, but rather than add
>> it to a list of pending items, it creates a new worker for each work
>> item. This means all writes that would've gone to io-wq will now just
>> fork a native worker and perform the write in a blocking fashion.
>>
>> 2) The fallback path for the above is that we punt it to a kthread,
>> which does the mm dance. This is similar to what we did before the
>> native workers. The fallback path is only hit if the worker creation
>> fails, but you can make it happen every time by just uncommenting that
>> return 1 in io_rewrite_io_thread().
>>
>> The interesting thing about approach 1 is that while it still uses the
>> native workers, it will not need to be processing task_work and hence
>> signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
>> forks off a new worker every time, which does the work, then exits.
>>
>> First try it as-is and see if that reproduces the issue. If it does,
>> then try uncommenting that return 1 mentioned in #2 above.
> 
> OK, so those two test cases worked, *but* I would have expected that,
> since when we go through the system workqueue (with its associated
> delays) it is functionally the same as inserting the udelay(1000) at

No we don't! For case 1, we use the exact same mechanism as the stock
kernel, the only difference is that we'll always hand it to a new
worker. Will that slow down some writes, certainly. Because before that
patch we'd potentially hand it over to an existing worker immediately.
Eg as soon as we did the spin_unlock() in io_wq_enqueue(), a new worker
could grab that same lock and start processing the work. We're not
talking a 1ms delay here, it's way shorter than that. This again to me
tells me it might be an ordering or barrier issue, but at the same time,
I've run with smp_mb() before and after insertion AND on retrieval of
the work item on the other side, and it triggers the issue even even.

On top of that, if we pre-create the workers, then we've already
established that the issue does not occur. With pre-created workers,
there's no extra delay between handing off the write and issuing it,
like we have with the test patch. In fact, it works the _exact_ same way
that the stock kernel does, except you don't have workers exiting or
being created. To me, this tells me that it cannot be a memory ordering
issue. If it was, we'd 100% see it for that case too, as we have all the
same handoff and execution as we did before. The only difference is that
we don't have an IPI for worker creation, and we don't have workers
exiting when they time out.

> the start of io_write().  I suspect the root issue predates the move
> to worker-managed I/O threads and that it was exposed simply by
> (inadvertently) shortening the delay between the initial io_write()
> call and the subsequent thread start.  Then, when the code started

Maybe? So far it seems very little certain knowledge exists about this
issue, unfortunately, other than it seems to be some very weird arch
interaction/issue.

> issuing IPIs to kick thread start even faster we ended up racing with
> this other mysterious process almost every time instead of somewhat
> rarely, basically compounding the problem and making analysis next to
> impossible until the IPI was skipped via the TWA_SIGNAL_NO_IPI flag.
> 
> I'm unsure if this is related to the oops I sent off-list, but I
> suspect it very well might be given a delay in the same location works
> around the corruption.

I don't think It's related to that at all, that's a worker creation vs
shutdown issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-16  3:46                                                                                                                                               ` Jens Axboe
@ 2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
  2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
  0 siblings, 1 reply; 95+ messages in thread
From: Timothy Pearson @ 2023-11-16  3:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 9:46:01 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 8:28 PM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 1:00:53 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/15/23 11:30 AM, Jens Axboe wrote:
>>>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>>>> currently soliciting feedback on what else might be going wrong at
>>>>>>> this point since I've already spent a couple of weeks on this and am
>>>>>>> not sure how much more time I can spend before we just have to shut
>>>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>>>
>>>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>>>> problem.
>>>>>>>
>>>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>>>> amount of workers being created overall?  I know with some of the
>>>>>>> other delay locations worker allocation was changing, from what I see
>>>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>>>> other thread to complete before moving on that might be valuable
>>>>>>> information -- would also potentially tie in to the IPI path still
>>>>>>> malfunctioning as the worker would immediately start executing.
>>>>>>
>>>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>>>> delay before we either pass to an existing worker or create a new one.
>>>>>>
>>>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>>>> understand it, different workers running on different cores could
>>>>>>> potentially be interacting with the same kiocb request and the same
>>>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>>>
>>>>>> This relies on the fact that we grab the wq lock before inserting this
>>>>>> work, and the unlocking will be a barrier. It's important to note that
>>>>>> this isn't any different than from before io-wq was using native
>>>>>> workers, the only difference is that it used to be kthreads before, and
>>>>>> now it's native threads to the application. The kthreads did a bunch of
>>>>>> work to assume the necessary identity to do the read or write operation
>>>>>> (which is ultimately why that approach went away, as it was just
>>>>>> inherently unsafe), whereas the native threads do not as they already
>>>>>> have what they need.
>>>>>>
>>>>>> I had a patch that just punted to a kthread and did the necessary
>>>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>>>> that point. Within the existing code...
>>>>>
>>>>> Would you happen to have that patch still?  It would provide a
>>>>> possible starting point for figuring out the exact difference.  If not
>>>>> I guess I could hack something similar up.
>>>>
>>>> Let me see if I can find it, and make sure it applies on the current
>>>> tree. I'll send you one in a bit.
>>>
>>> Wrote a new one. This one has two different ways it can work:
>>>
>>> 1) By default, it uses the native io workers still, but rather than add
>>> it to a list of pending items, it creates a new worker for each work
>>> item. This means all writes that would've gone to io-wq will now just
>>> fork a native worker and perform the write in a blocking fashion.
>>>
>>> 2) The fallback path for the above is that we punt it to a kthread,
>>> which does the mm dance. This is similar to what we did before the
>>> native workers. The fallback path is only hit if the worker creation
>>> fails, but you can make it happen every time by just uncommenting that
>>> return 1 in io_rewrite_io_thread().
>>>
>>> The interesting thing about approach 1 is that while it still uses the
>>> native workers, it will not need to be processing task_work and hence
>>> signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
>>> forks off a new worker every time, which does the work, then exits.
>>>
>>> First try it as-is and see if that reproduces the issue. If it does,
>>> then try uncommenting that return 1 mentioned in #2 above.
>> 
>> OK, so those two test cases worked, *but* I would have expected that,
>> since when we go through the system workqueue (with its associated
>> delays) it is functionally the same as inserting the udelay(1000) at
> 
> No we don't! For case 1, we use the exact same mechanism as the stock
> kernel, the only difference is that we'll always hand it to a new
> worker. Will that slow down some writes, certainly. Because before that
> patch we'd potentially hand it over to an existing worker immediately.
> Eg as soon as we did the spin_unlock() in io_wq_enqueue(), a new worker
> could grab that same lock and start processing the work. We're not
> talking a 1ms delay here, it's way shorter than that. This again to me
> tells me it might be an ordering or barrier issue, but at the same time,
> I've run with smp_mb() before and after insertion AND on retrieval of
> the work item on the other side, and it triggers the issue even even.
> 
> On top of that, if we pre-create the workers, then we've already
> established that the issue does not occur. With pre-created workers,
> there's no extra delay between handing off the write and issuing it,
> like we have with the test patch. In fact, it works the _exact_ same way
> that the stock kernel does, except you don't have workers exiting or
> being created. To me, this tells me that it cannot be a memory ordering
> issue. If it was, we'd 100% see it for that case too, as we have all the
> same handoff and execution as we did before. The only difference is that
> we don't have an IPI for worker creation, and we don't have workers
> exiting when they time out.
<snip>

OK, fair enough, I've been at this long enough I forgot about the precreated workers "fixing" things.  Let me step back a bit and meta-analyze all of what we've learned, even though a lot of it doesn't make any sense maybe there's a pattern somewhere in all the noise.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Regression in io_uring, leading to data corruption
  2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
@ 2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
  0 siblings, 0 replies; 95+ messages in thread
From: Timothy Pearson @ 2023-11-19  0:16 UTC (permalink / raw)
  To: Timothy Pearson
  Cc: Jens Axboe, regressions, Pavel Begunkov, Michael Ellerman, npiggin



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Jens Axboe" <axboe@kernel.dk>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 9:54:50 PM
> Subject: Re: Regression in io_uring, leading to data corruption
> 
> OK, fair enough, I've been at this long enough I forgot about the precreated
> workers "fixing" things.  Let me step back a bit and meta-analyze all of what
> we've learned, even though a lot of it doesn't make any sense maybe there's a
> pattern somewhere in all the noise.

To close the loop on this long thread, I finally found the root cause and submitted a correct patch here:

https://lore.kernel.org/linuxppc-dev/1105090647.48374193.1700351103830.JavaMail.zimbra@raptorengineeringinc.com/T/#u

500+ loops and counting on this boot alone.  I think we're good now.  It was a ppc64 specific bug in the end, corruption of a specific FPU register. 
 It was definitely provoked by specific signals being sent, but also had a somewhat wider scope than just that, which was making debug quite difficult.

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2023-11-19  0:16 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.