regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Timothy Pearson <tpearson@raptorengineering.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: regressions <regressions@lists.linux.dev>,
	 Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Mon, 13 Nov 2023 18:52:51 -0600 (CST)	[thread overview]
Message-ID: <1270025902.47126081.1699923171139.JavaMail.zimbra@raptorengineeringinc.com> (raw)
In-Reply-To: <7f74cf55-d1ca-482b-95c3-27b02996d999@kernel.dk>



----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 6:13:10 PM
> Subject: Re: Regression in io_uring, leading to data corruption

> On 11/13/23 5:04 PM, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 5:48:12 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>> 
>>> On 11/13/23 4:19 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 4:15:44 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>>>>> "Pavel Begunkov"
>>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>>>>> the outer page checksum was still valid.
>>>>>>>>>>>
>>>>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>>>>> corruption observed on production systems.
>>>>>>>>>>
>>>>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>>>>> disappear.
>>>>>>>>>>
>>>>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>>>>> the io worker punting process.
>>>>>>>>>>
>>>>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>>>>> back out from disk into the page verification routine also without any
>>>>>>>>>> additional corruption.  The page verification routine decrypts the
>>>>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>>>>> corruption warning and halt the test run.
>>>>>>>>>>
>>>>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>>>>
>>>>>>>>>> As always, investigation continues...
>>>>>>>>>
>>>>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>>>>> specific copy section.
>>>>>>>>>
>>>>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>>>>> then, there's been an attempt to make this generic:
>>>>>>>>>
>>>>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>>>>>>
>>>>>>>>>    fork: Generalize PF_IO_WORKER handling
>>>>>>>>>
>>>>>>>>> and later a powerpc change related to that too:
>>>>>>>>>
>>>>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>>>>>>
>>>>>>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>>>>
>>>>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>>>>> and see if it makes any difference.
>>>>>>>>>
>>>>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>>>>> in that specific buffer.
>>>>>>>>>
>>>>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>>>>> kernel_clone_args *args)
>>>>>>>>>
>>>>>>>>> 	klp_init_thread_info(p);
>>>>>>>>>
>>>>>>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>>>>> 		/* kernel thread */
>>>>>>>>>
>>>>>>>>> 		/* Create initial minimum stack frame. */
>>>>>>>>
>>>>>>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>>>>>>> welcome while I continue to debug...
>>>>>>>
>>>>>>> I ponder if it's still in there... I don't see what else could be poking
>>>>>>> and causing user memory corruption.
>>>>>>
>>>>>> Indeed, I'm really scratching my head on this one.  The corruption is
>>>>>> definitely present, and quite sensitive to exact timing around page
>>>>>> encryption start -- presumably if the wild write happens after the
>>>>>> encryption routine finishes, it no longer matters for this particular
>>>>>> test suite.
>>>>>>
>>>>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>>>>> punt at least once per write -- no change in how often the corruption
>>>>>> occurred.  I also hacked it up to try to keep I/O workers around vs.
>>>>>> constantly tearing then down and respawning them, with no real change
>>>>>> observed in corruption frequency, possibly because even with that we
>>>>>> still end up creating a new I/O worker every so often.
>>>>>>
>>>>>> What did have a major effect was hacking the kernel to both punt at
>>>>>> least once per write *and* to aggressively exit I/O worker threads,
>>>>>> indicating that something in thread setup or teardown is stomping on
>>>>>> memory.  When I say "aggressively exit I/O worker threads", I
>>>>>> basically did this:
>>>>>
>>>>> It's been my suspicion, as per previous email from today, that this is
>>>>> related to worker creation on ppc. You can try this patch, which just
>>>>> pre-creates workers and don't let them time out. That means that the max
>>>>> number of workers for bounded work is pre-created before the ring is
>>>>> used, so we'll never see any worker creation. If this works, then it's
>>>>> certainly something related to worker creation.
>>>>
>>>> Yep, that makes the issue disappear.  I wish I knew if it it was
>>>> always stepping on memory somewhere and it just hits unimportant
>>>> process memory most of the time, or if it's only stepping on memory
>>>> iff the tight timing conditions are met.
>>>>
>>>> Technically it could be either worker creation or worker destruction.
>>>> Any quick way to distinguish between the two?  E.g. create threads,
>>>> allow them to stop processing by timing out, but never tear them down
>>>> somehow?  Obviously we'd eventually exhaust the system thread resource
>>>> limits, but for a quick test it might be enough?
>>>
>>> Sure, we could certainly do that. Something like the below should do
>>> that, it goes through the normal teardown on timeout, but doesn't
>>> actually call do_exit() until the wq is being torn down anyway. That
>>> should ensure that we create workers as we need them, but when they time
>>> out, they won't actually exit until we are tearing down anyway on ring
>>> exit.
>>>
>>>> I'm also a bit perplexed as to how we can be stomping on user memory
>>>> like this without some kind of page fault occurring.  If we can
>>>> isolate things to thread creation vs. thread teardown, I'll go
>>>> function by function and see what is going wrong.
>>>
>>> Agree, it's very odd.
>>>
>>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>>> index 522196dfb0ff..51a82daaac36 100644
>>> --- a/io_uring/io-wq.c
>>> +++ b/io_uring/io-wq.c
>>> @@ -16,6 +16,7 @@
>>> #include <linux/task_work.h>
>>> #include <linux/audit.h>
>>> #include <linux/mmu_context.h>
>>> +#include <linux/delay.h>
>>> #include <uapi/linux/io_uring.h>
>>>
>>> #include "io-wq.h"
>>> @@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
>>>
>>> 	kfree_rcu(worker, rcu);
>>> 	io_worker_ref_put(wq);
>>> +
>>> +	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
>>> +		msleep(500);
>>> 	do_exit(0);
>>> }
>> 
>> Thanks, was working up something similar but neglected the workqueue
>> exit so got a hang.  With this patch, I still see the corruption, but
>> all that's really telling me is that the core code inside do_exit() is
>> OK (including, hopefully, the arch-specific stuff).  I'd really like
>> to rule out the rest of the code in io_worker_exit(), is there a way
>> to (easily) tell the workqueue to ignore a worker entirely without
>> going through all the teardown in io_worker_exit() (and specifically
>> the cancellation / release code)?
> 
> You could try this one - totally untested...

Seemed to do what I wanted, however corruption remains.  Was worth a try...

I guess I'll proceed under the assumption that somehow the thread setup is stomping on userspace for now.

  reply	other threads:[~2023-11-14  0:52 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson [this message]
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1270025902.47126081.1699923171139.JavaMail.zimbra@raptorengineeringinc.com \
    --to=tpearson@raptorengineering.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).