From: Timothy Pearson <tpearson@raptorengineering.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: regressions <regressions@lists.linux.dev>,
Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Mon, 13 Nov 2023 18:52:51 -0600 (CST) [thread overview]
Message-ID: <1270025902.47126081.1699923171139.JavaMail.zimbra@raptorengineeringinc.com> (raw)
In-Reply-To: <7f74cf55-d1ca-482b-95c3-27b02996d999@kernel.dk>
----- Original Message -----
> From: "Jens Axboe" <axboe@kernel.dk>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
> Sent: Monday, November 13, 2023 6:13:10 PM
> Subject: Re: Regression in io_uring, leading to data corruption
> On 11/13/23 5:04 PM, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>> <asml.silence@gmail.com>
>>> Sent: Monday, November 13, 2023 5:48:12 PM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/13/23 4:19 PM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Monday, November 13, 2023 4:15:44 PM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/13/23 1:58 PM, Timothy Pearson wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Monday, November 13, 2023 1:29:38 PM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>
>>>>>>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>>>>>>> "Pavel Begunkov"
>>>>>>>>>>> <asml.silence@gmail.com>
>>>>>>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately I got led down a rabbit hole here. With the tests I was running,
>>>>>>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>>>>>>> the outer page checksum was still valid.
>>>>>>>>>>>
>>>>>>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>>>>>>> corruption more easily in the on-disk format. We are still apparently dealing
>>>>>>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>>>>>>> corruption observed on production systems.
>>>>>>>>>>
>>>>>>>>>> Quick status update -- after considerable effort applied I've managed
>>>>>>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>>>>>>> cause. The bug is incredibly timing-dependent, therefore it is
>>>>>>>>>> difficult to instrument the code paths I need without causing it to
>>>>>>>>>> disappear.
>>>>>>>>>>
>>>>>>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>>>>>>> by the exact timing of some of the encryption tests in mariadb. I've
>>>>>>>>>> caught the wild write a few times now, it is not in the standard
>>>>>>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>>>>>>> the io worker punting process.
>>>>>>>>>>
>>>>>>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>>>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>>>>>>> one of the mariadb in-RAM database pages at a random offset. This
>>>>>>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>>>>>>> via io_uring. I have confirmed that the post-encryption data in RAM
>>>>>>>>>> is written to disk without any additional corruption, and is then read
>>>>>>>>>> back out from disk into the page verification routine also without any
>>>>>>>>>> additional corruption. The page verification routine decrypts the
>>>>>>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>>>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>>>>>>> corruption warning and halt the test run.
>>>>>>>>>>
>>>>>>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>>>>>>> right before the encryption routine, the bug disappears (or, more
>>>>>>>>>> precisely, is masked). If I had to guess from these symptoms, I'd
>>>>>>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>>>>>>> context from somewhere, and scribbling some kind of status data into
>>>>>>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>>>>>>> database pages. This could be an application issue or a kernel issue,
>>>>>>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>>>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>>>>>>
>>>>>>>>>> As always, investigation continues...
>>>>>>>>>
>>>>>>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>>>>>>> necessarily ppc memory ordering related, but just something in the arch
>>>>>>>>> specific copy section.
>>>>>>>>>
>>>>>>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>>>>>>> then, there's been an attempt to make this generic:
>>>>>>>>>
>>>>>>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>>>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>>>> Date: Tue Apr 12 10:18:48 2022 -0500
>>>>>>>>>
>>>>>>>>> fork: Generalize PF_IO_WORKER handling
>>>>>>>>>
>>>>>>>>> and later a powerpc change related to that too:
>>>>>>>>>
>>>>>>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>>>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>>>>>> Date: Sat Mar 25 22:29:01 2023 +1000
>>>>>>>>>
>>>>>>>>> powerpc: copy_thread differentiate kthreads and user mode threads
>>>>>>>>>
>>>>>>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>>>>>>> the finer details of powerpc thread creation, but maybe try with this
>>>>>>>>> and see if it makes any difference.
>>>>>>>>>
>>>>>>>>> As you note in your reply, we could very well be corrupting some bytes
>>>>>>>>> somewhere every time. We just only notice quickly when it happens to be
>>>>>>>>> in that specific buffer.
>>>>>>>>>
>>>>>>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>>>>>>> index 392404688cec..d4dec2fd091c 100644
>>>>>>>>> --- a/arch/powerpc/kernel/process.c
>>>>>>>>> +++ b/arch/powerpc/kernel/process.c
>>>>>>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>>>>>>> kernel_clone_args *args)
>>>>>>>>>
>>>>>>>>> klp_init_thread_info(p);
>>>>>>>>>
>>>>>>>>> - if (unlikely(p->flags & PF_KTHREAD)) {
>>>>>>>>> + if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>>>>>>> /* kernel thread */
>>>>>>>>>
>>>>>>>>> /* Create initial minimum stack frame. */
>>>>>>>>
>>>>>>>> Good idea, but didn't work unfortunately. Any other suggestions
>>>>>>>> welcome while I continue to debug...
>>>>>>>
>>>>>>> I ponder if it's still in there... I don't see what else could be poking
>>>>>>> and causing user memory corruption.
>>>>>>
>>>>>> Indeed, I'm really scratching my head on this one. The corruption is
>>>>>> definitely present, and quite sensitive to exact timing around page
>>>>>> encryption start -- presumably if the wild write happens after the
>>>>>> encryption routine finishes, it no longer matters for this particular
>>>>>> test suite.
>>>>>>
>>>>>> Trying to find sensitive areas in the kernel, I hacked it up to always
>>>>>> punt at least once per write -- no change in how often the corruption
>>>>>> occurred. I also hacked it up to try to keep I/O workers around vs.
>>>>>> constantly tearing then down and respawning them, with no real change
>>>>>> observed in corruption frequency, possibly because even with that we
>>>>>> still end up creating a new I/O worker every so often.
>>>>>>
>>>>>> What did have a major effect was hacking the kernel to both punt at
>>>>>> least once per write *and* to aggressively exit I/O worker threads,
>>>>>> indicating that something in thread setup or teardown is stomping on
>>>>>> memory. When I say "aggressively exit I/O worker threads", I
>>>>>> basically did this:
>>>>>
>>>>> It's been my suspicion, as per previous email from today, that this is
>>>>> related to worker creation on ppc. You can try this patch, which just
>>>>> pre-creates workers and don't let them time out. That means that the max
>>>>> number of workers for bounded work is pre-created before the ring is
>>>>> used, so we'll never see any worker creation. If this works, then it's
>>>>> certainly something related to worker creation.
>>>>
>>>> Yep, that makes the issue disappear. I wish I knew if it it was
>>>> always stepping on memory somewhere and it just hits unimportant
>>>> process memory most of the time, or if it's only stepping on memory
>>>> iff the tight timing conditions are met.
>>>>
>>>> Technically it could be either worker creation or worker destruction.
>>>> Any quick way to distinguish between the two? E.g. create threads,
>>>> allow them to stop processing by timing out, but never tear them down
>>>> somehow? Obviously we'd eventually exhaust the system thread resource
>>>> limits, but for a quick test it might be enough?
>>>
>>> Sure, we could certainly do that. Something like the below should do
>>> that, it goes through the normal teardown on timeout, but doesn't
>>> actually call do_exit() until the wq is being torn down anyway. That
>>> should ensure that we create workers as we need them, but when they time
>>> out, they won't actually exit until we are tearing down anyway on ring
>>> exit.
>>>
>>>> I'm also a bit perplexed as to how we can be stomping on user memory
>>>> like this without some kind of page fault occurring. If we can
>>>> isolate things to thread creation vs. thread teardown, I'll go
>>>> function by function and see what is going wrong.
>>>
>>> Agree, it's very odd.
>>>
>>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>>> index 522196dfb0ff..51a82daaac36 100644
>>> --- a/io_uring/io-wq.c
>>> +++ b/io_uring/io-wq.c
>>> @@ -16,6 +16,7 @@
>>> #include <linux/task_work.h>
>>> #include <linux/audit.h>
>>> #include <linux/mmu_context.h>
>>> +#include <linux/delay.h>
>>> #include <uapi/linux/io_uring.h>
>>>
>>> #include "io-wq.h"
>>> @@ -239,6 +240,9 @@ static void io_worker_exit(struct io_worker *worker)
>>>
>>> kfree_rcu(worker, rcu);
>>> io_worker_ref_put(wq);
>>> +
>>> + while (!test_bit(IO_WQ_BIT_EXIT, &wq->state))
>>> + msleep(500);
>>> do_exit(0);
>>> }
>>
>> Thanks, was working up something similar but neglected the workqueue
>> exit so got a hang. With this patch, I still see the corruption, but
>> all that's really telling me is that the core code inside do_exit() is
>> OK (including, hopefully, the arch-specific stuff). I'd really like
>> to rule out the rest of the code in io_worker_exit(), is there a way
>> to (easily) tell the workqueue to ignore a worker entirely without
>> going through all the teardown in io_worker_exit() (and specifically
>> the cancellation / release code)?
>
> You could try this one - totally untested...
Seemed to do what I wanted, however corruption remains. Was worth a try...
I guess I'll proceed under the assumption that somehow the thread setup is stomping on userspace for now.
next prev parent reply other threads:[~2023-11-14 0:52 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57 ` Timothy Pearson
2023-11-07 17:14 ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39 ` Timothy Pearson
2023-11-07 21:46 ` Jens Axboe
2023-11-07 22:07 ` Timothy Pearson
2023-11-07 22:16 ` Jens Axboe
2023-11-07 22:29 ` Timothy Pearson
2023-11-07 22:44 ` Jens Axboe
2023-11-07 23:12 ` Timothy Pearson
2023-11-07 23:16 ` Jens Axboe
2023-11-07 23:34 ` Timothy Pearson
2023-11-07 23:52 ` Jens Axboe
2023-11-08 0:02 ` Timothy Pearson
2023-11-08 0:09 ` Jens Axboe
2023-11-08 3:27 ` Timothy Pearson
2023-11-08 3:30 ` Timothy Pearson
2023-11-08 4:00 ` Timothy Pearson
2023-11-08 15:10 ` Jens Axboe
2023-11-08 15:14 ` Jens Axboe
2023-11-08 17:10 ` Timothy Pearson
2023-11-08 17:26 ` Jens Axboe
2023-11-08 17:40 ` Timothy Pearson
2023-11-08 17:49 ` Jens Axboe
2023-11-08 17:57 ` Jens Axboe
2023-11-08 18:36 ` Timothy Pearson
2023-11-08 18:51 ` Timothy Pearson
2023-11-08 19:08 ` Jens Axboe
2023-11-08 19:06 ` Jens Axboe
2023-11-08 22:05 ` Jens Axboe
2023-11-08 22:15 ` Timothy Pearson
2023-11-08 22:18 ` Jens Axboe
2023-11-08 22:28 ` Timothy Pearson
2023-11-08 23:58 ` Jens Axboe
2023-11-09 15:12 ` Jens Axboe
2023-11-09 17:00 ` Timothy Pearson
2023-11-09 17:17 ` Jens Axboe
2023-11-09 17:24 ` Timothy Pearson
2023-11-09 17:30 ` Jens Axboe
2023-11-09 17:36 ` Timothy Pearson
2023-11-09 17:38 ` Jens Axboe
2023-11-09 17:42 ` Timothy Pearson
2023-11-09 17:45 ` Jens Axboe
2023-11-09 18:20 ` tpearson
2023-11-10 3:51 ` Jens Axboe
2023-11-10 4:35 ` Timothy Pearson
2023-11-10 6:48 ` Timothy Pearson
2023-11-10 14:52 ` Jens Axboe
2023-11-11 18:42 ` Timothy Pearson
2023-11-11 18:58 ` Jens Axboe
2023-11-11 19:04 ` Timothy Pearson
2023-11-11 19:11 ` Jens Axboe
2023-11-11 19:15 ` Timothy Pearson
2023-11-11 19:23 ` Jens Axboe
2023-11-11 21:57 ` Timothy Pearson
2023-11-13 17:06 ` Timothy Pearson
2023-11-13 17:39 ` Jens Axboe
2023-11-13 19:02 ` Timothy Pearson
2023-11-13 19:29 ` Jens Axboe
2023-11-13 20:58 ` Timothy Pearson
2023-11-13 21:22 ` Timothy Pearson
2023-11-13 22:15 ` Jens Axboe
2023-11-13 23:19 ` Timothy Pearson
2023-11-13 23:48 ` Jens Axboe
2023-11-14 0:04 ` Timothy Pearson
2023-11-14 0:13 ` Jens Axboe
2023-11-14 0:52 ` Timothy Pearson [this message]
2023-11-14 5:06 ` Timothy Pearson
2023-11-14 13:17 ` Jens Axboe
2023-11-14 16:59 ` Timothy Pearson
2023-11-14 17:04 ` Jens Axboe
2023-11-14 17:14 ` Timothy Pearson
2023-11-14 17:17 ` Jens Axboe
2023-11-14 17:21 ` Timothy Pearson
2023-11-14 17:57 ` Timothy Pearson
2023-11-14 18:02 ` Jens Axboe
2023-11-14 18:12 ` Timothy Pearson
2023-11-14 18:26 ` Jens Axboe
2023-11-15 11:03 ` Timothy Pearson
2023-11-15 16:46 ` Jens Axboe
2023-11-15 17:03 ` Timothy Pearson
2023-11-15 18:30 ` Jens Axboe
2023-11-15 18:35 ` Timothy Pearson
2023-11-15 18:37 ` Jens Axboe
2023-11-15 18:40 ` Timothy Pearson
2023-11-15 19:00 ` Jens Axboe
2023-11-16 3:28 ` Timothy Pearson
2023-11-16 3:46 ` Jens Axboe
2023-11-16 3:54 ` Timothy Pearson
2023-11-19 0:16 ` Timothy Pearson
2023-11-13 20:47 ` Jens Axboe
2023-11-13 21:08 ` Timothy Pearson
2023-11-10 14:48 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1270025902.47126081.1699923171139.JavaMail.zimbra@raptorengineeringinc.com \
--to=tpearson@raptorengineering.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=regressions@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).