All of lore.kernel.org
 help / color / mirror / Atom feed
From: Timothy Pearson <tpearson@raptorengineering.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: regressions <regressions@lists.linux.dev>,
	 Pavel Begunkov <asml.silence@gmail.com>,
	 Michael Ellerman <mpe@ellerman.id.au>,
	npiggin <npiggin@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Wed, 15 Nov 2023 21:54:50 -0600 (CST)	[thread overview]
Message-ID: <368367067.47661779.1700106890793.JavaMail.zimbra@raptorengineeringinc.com> (raw)
In-Reply-To: <5fb3fb19-cb9b-42d0-8949-7a9be8750682@kernel.dk>



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Sent: Wednesday, November 15, 2023 9:46:01 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/15/23 8:28 PM, Timothy Pearson wrote:
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 1:00:53 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/15/23 11:30 AM, Jens Axboe wrote:
>>>> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>, "Michael Ellerman"
>>>>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>>>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>
>>>>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>>>>> I haven't had much success in getting the IPI path to work properly,
>>>>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>>>>> currently soliciting feedback on what else might be going wrong at
>>>>>>> this point since I've already spent a couple of weeks on this and am
>>>>>>> not sure how much more time I can spend before we just have to shut
>>>>>>> io_uring down on ppc64 for the forseeable future.
>>>>>>>
>>>>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>>>>> function, however no amount of memory barriers in the io_queue_async()
>>>>>>> path (including in the kbuf recycling code) will fully resolve the
>>>>>>> problem.
>>>>>>>
>>>>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>>>>> amount of workers being created overall?  I know with some of the
>>>>>>> other delay locations worker allocation was changing, from what I see
>>>>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>>>>> other thread to complete before moving on that might be valuable
>>>>>>> information -- would also potentially tie in to the IPI path still
>>>>>>> malfunctioning as the worker would immediately start executing.
>>>>>>
>>>>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>>>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>>>>> delay before we either pass to an existing worker or create a new one.
>>>>>>
>>>>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>>>>> list guaranteed, especially on weak memory model systems?  As I
>>>>>>> understand it, different workers running on different cores could
>>>>>>> potentially be interacting with the same kiocb request and the same
>>>>>>> buffer list, and that does dovetail with the fact that punting to a
>>>>>>> different I/O worker (usually on another core) seems to provoke the
>>>>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>>>>> functions without too much success -- it seemed to help somewhat, but
>>>>>>> nowhere near complete resolution, and the buffers are used in a number
>>>>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>>>>> feedback on this concept before going down yet another rabbit hole...
>>>>>>
>>>>>> This relies on the fact that we grab the wq lock before inserting this
>>>>>> work, and the unlocking will be a barrier. It's important to note that
>>>>>> this isn't any different than from before io-wq was using native
>>>>>> workers, the only difference is that it used to be kthreads before, and
>>>>>> now it's native threads to the application. The kthreads did a bunch of
>>>>>> work to assume the necessary identity to do the read or write operation
>>>>>> (which is ultimately why that approach went away, as it was just
>>>>>> inherently unsafe), whereas the native threads do not as they already
>>>>>> have what they need.
>>>>>>
>>>>>> I had a patch that just punted to a kthread and did the necessary
>>>>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>>>>> that point. Within the existing code...
>>>>>
>>>>> Would you happen to have that patch still?  It would provide a
>>>>> possible starting point for figuring out the exact difference.  If not
>>>>> I guess I could hack something similar up.
>>>>
>>>> Let me see if I can find it, and make sure it applies on the current
>>>> tree. I'll send you one in a bit.
>>>
>>> Wrote a new one. This one has two different ways it can work:
>>>
>>> 1) By default, it uses the native io workers still, but rather than add
>>> it to a list of pending items, it creates a new worker for each work
>>> item. This means all writes that would've gone to io-wq will now just
>>> fork a native worker and perform the write in a blocking fashion.
>>>
>>> 2) The fallback path for the above is that we punt it to a kthread,
>>> which does the mm dance. This is similar to what we did before the
>>> native workers. The fallback path is only hit if the worker creation
>>> fails, but you can make it happen every time by just uncommenting that
>>> return 1 in io_rewrite_io_thread().
>>>
>>> The interesting thing about approach 1 is that while it still uses the
>>> native workers, it will not need to be processing task_work and hence
>>> signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
>>> forks off a new worker every time, which does the work, then exits.
>>>
>>> First try it as-is and see if that reproduces the issue. If it does,
>>> then try uncommenting that return 1 mentioned in #2 above.
>> 
>> OK, so those two test cases worked, *but* I would have expected that,
>> since when we go through the system workqueue (with its associated
>> delays) it is functionally the same as inserting the udelay(1000) at
> 
> No we don't! For case 1, we use the exact same mechanism as the stock
> kernel, the only difference is that we'll always hand it to a new
> worker. Will that slow down some writes, certainly. Because before that
> patch we'd potentially hand it over to an existing worker immediately.
> Eg as soon as we did the spin_unlock() in io_wq_enqueue(), a new worker
> could grab that same lock and start processing the work. We're not
> talking a 1ms delay here, it's way shorter than that. This again to me
> tells me it might be an ordering or barrier issue, but at the same time,
> I've run with smp_mb() before and after insertion AND on retrieval of
> the work item on the other side, and it triggers the issue even even.
> 
> On top of that, if we pre-create the workers, then we've already
> established that the issue does not occur. With pre-created workers,
> there's no extra delay between handing off the write and issuing it,
> like we have with the test patch. In fact, it works the _exact_ same way
> that the stock kernel does, except you don't have workers exiting or
> being created. To me, this tells me that it cannot be a memory ordering
> issue. If it was, we'd 100% see it for that case too, as we have all the
> same handoff and execution as we did before. The only difference is that
> we don't have an IPI for worker creation, and we don't have workers
> exiting when they time out.
<snip>

OK, fair enough, I've been at this long enough I forgot about the precreated workers "fixing" things.  Let me step back a bit and meta-analyze all of what we've learned, even though a lot of it doesn't make any sense maybe there's a pattern somewhere in all the noise.

  reply	other threads:[~2023-11-16  3:54 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson [this message]
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=368367067.47661779.1700106890793.JavaMail.zimbra@raptorengineeringinc.com \
    --to=tpearson@raptorengineering.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.