All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Timothy Pearson <tpearson@raptorengineering.com>
Cc: regressions <regressions@lists.linux.dev>,
	Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Mon, 13 Nov 2023 15:15:44 -0700	[thread overview]
Message-ID: <483d8017-4d15-4367-bac4-0e420cb4eb54@kernel.dk> (raw)
In-Reply-To: <1843925555.47083836.1699909110545.JavaMail.zimbra@raptorengineeringinc.com>

On 11/13/23 1:58 PM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Monday, November 13, 2023 1:29:38 PM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/13/23 12:02 PM, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>> <asml.silence@gmail.com>
>>>> Sent: Monday, November 13, 2023 11:39:30 AM
>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>
>>>> On 11/13/23 10:06 AM, Timothy Pearson wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>>>>>> "Pavel Begunkov"
>>>>>> <asml.silence@gmail.com>
>>>>>> Sent: Saturday, November 11, 2023 3:57:23 PM
>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>> Unfortunately I got led down a rabbit hole here.  With the tests I was running,
>>>>>> MariaDB writes the encrypted data separately from the normal un-encrypted page
>>>>>> header and checksums, and the internal encrpyted data on disk was corrupt while
>>>>>> the outer page checksum was still valid.
>>>>>>
>>>>>> I have since switched to the main.xa_prepared_binlog_off test, which shows the
>>>>>> corruption more easily in the on-disk format.  We are still apparently dealing
>>>>>> with a write path issue, which makes more sense given the nature of the
>>>>>> corruption observed on production systems.
>>>>>
>>>>> Quick status update -- after considerable effort applied I've managed
>>>>> to narrow down what is going wrong, but still need to locate the root
>>>>> cause.  The bug is incredibly timing-dependent, therefore it is
>>>>> difficult to instrument the code paths I need without causing it to
>>>>> disappear.
>>>>>
>>>>> What we're dealing with is a wild write to RAM of some sort, provoked
>>>>> by the exact timing of some of the encryption tests in mariadb.  I've
>>>>> caught the wild write a few times now, it is not in the standard
>>>>> io_uring write path but instead appears to be triggered (somehow) by
>>>>> the io worker punting process.
>>>>>
>>>>> When the bug is hit, and if all other conditions are exactly correct,
>>>>> *something* (still to be identified) writes 32 bytes of gibberish into
>>>>> one of the mariadb in-RAM database pages at a random offset.  This
>>>>> wild write occurs right before the page is encrypted for write to disk
>>>>> via io_uring.  I have confirmed that the post-encryption data in RAM
>>>>> is written to disk without any additional corruption, and is then read
>>>>> back out from disk into the page verification routine also without any
>>>>> additional corruption.  The page verification routine decrypts the
>>>>> data from disk, thus restoring the decrypted data that contains the
>>>>> wild write data stamped somewhere on it, where we then hit the
>>>>> corruption warning and halt the test run.
>>>>>
>>>>> Irritatingly, if I try to instrument the data flow in the application
>>>>> right before the encryption routine, the bug disappears (or, more
>>>>> precisely, is masked).  If I had to guess from these symptoms, I'd
>>>>> suspect the application io worker thread is waking up, grabbing wrong
>>>>> context from somewhere, and scribbling some kind of status data into
>>>>> memory, which rarely ends up being on top of one of the in-RAM
>>>>> database pages.  This could be an application issue or a kernel issue,
>>>>> I'm not sure yet, but given the precise timing requirements I'm less
>>>>> and less surprised this is only showing on ppc64 right now.
>>>>>
>>>>> As always, investigation continues...
>>>>
>>>> I wonder if this has to do with copy_thread() on powerpc - so not
>>>> necessarily ppc memory ordering related, but just something in the arch
>>>> specific copy section.
>>>>
>>>> I took a look back, and the initial change actually forgot ppc. Since
>>>> then, there's been an attempt to make this generic:
>>>>
>>>> commit 5bd2e97c868a8a44470950ed01846cab6328e540
>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date:   Tue Apr 12 10:18:48 2022 -0500
>>>>
>>>>    fork: Generalize PF_IO_WORKER handling
>>>>
>>>> and later a powerpc change related to that too:
>>>>
>>>> commit eed7c420aac7fde5e5915d2747c3ebbbda225835
>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>> Date:   Sat Mar 25 22:29:01 2023 +1000
>>>>
>>>>    powerpc: copy_thread differentiate kthreads and user mode threads
>>>>
>>>> Just stabbing in the dark a bit here as I won't pretend to understand
>>>> the finer details of powerpc thread creation, but maybe try with this
>>>> and see if it makes any difference.
>>>>
>>>> As you note in your reply, we could very well be corrupting some bytes
>>>> somewhere every time. We just only notice quickly when it happens to be
>>>> in that specific buffer.
>>>>
>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>>> index 392404688cec..d4dec2fd091c 100644
>>>> --- a/arch/powerpc/kernel/process.c
>>>> +++ b/arch/powerpc/kernel/process.c
>>>> @@ -1758,7 +1758,7 @@ int copy_thread(struct task_struct *p, const struct
>>>> kernel_clone_args *args)
>>>>
>>>> 	klp_init_thread_info(p);
>>>>
>>>> -	if (unlikely(p->flags & PF_KTHREAD)) {
>>>> +	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>>> 		/* kernel thread */
>>>>
>>>> 		/* Create initial minimum stack frame. */
>>>
>>> Good idea, but didn't work unfortunately.  Any other suggestions
>>> welcome while I continue to debug...
>>
>> I ponder if it's still in there... I don't see what else could be poking
>> and causing user memory corruption.
> 
> Indeed, I'm really scratching my head on this one.  The corruption is
> definitely present, and quite sensitive to exact timing around page
> encryption start -- presumably if the wild write happens after the
> encryption routine finishes, it no longer matters for this particular
> test suite.
> 
> Trying to find sensitive areas in the kernel, I hacked it up to always
> punt at least once per write -- no change in how often the corruption
> occurred.  I also hacked it up to try to keep I/O workers around vs.
> constantly tearing then down and respawning them, with no real change
> observed in corruption frequency, possibly because even with that we
> still end up creating a new I/O worker every so often.
> 
> What did have a major effect was hacking the kernel to both punt at
> least once per write *and* to aggressively exit I/O worker threads,
> indicating that something in thread setup or teardown is stomping on
> memory.  When I say "aggressively exit I/O worker threads", I
> basically did this:

It's been my suspicion, as per previous email from today, that this is
related to worker creation on ppc. You can try this patch, which just
pre-creates workers and don't let them time out. That means that the max
number of workers for bounded work is pre-created before the ring is
used, so we'll never see any worker creation. If this works, then it's
certainly something related to worker creation.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 522196dfb0ff..e87cabe5bbb7 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -673,7 +673,7 @@ static int io_wq_worker(void *data)
 			break;
 		}
 		if (!ret) {
-			last_timeout = true;
+			// last_timeout = true;
 			exit_mask = !cpumask_test_cpu(raw_smp_processor_id(),
 							wq->cpu_mask);
 		}
@@ -947,8 +947,8 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	do_create = !io_wq_activate_free_worker(wq, acct);
 	rcu_read_unlock();
 
-	if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
-	    !atomic_read(&acct->nr_running))) {
+	if (0 && (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
+	    !atomic_read(&acct->nr_running)))) {
 		bool did_create;
 
 		did_create = io_wq_create_worker(wq, acct);
@@ -1138,6 +1138,33 @@ static int io_wq_hash_wake(struct wait_queue_entry *wait, unsigned mode,
 	return 1;
 }
 
+static void pre_create_workers(struct io_wq *wq)
+{
+	struct io_wq_acct *acct = &wq->acct[IO_WQ_ACCT_BOUND];
+	int i, ret, to_create = acct->max_workers;
+
+	raw_spin_lock(&wq->lock);
+	acct->nr_workers = to_create;
+	atomic_add(to_create, &acct->nr_running);
+	atomic_add(to_create, &wq->worker_refs);
+	raw_spin_unlock(&wq->lock);
+
+	for (i = 0; i < acct->max_workers; i++) {
+		ret = create_io_worker(wq, IO_WQ_ACCT_BOUND);
+		if (WARN_ON_ONCE(!ret))
+			break;
+	}
+
+	if (i != to_create) {
+		to_create -= i;
+		raw_spin_lock(&wq->lock);
+		acct->nr_workers -= to_create;
+		atomic_sub(to_create, &acct->nr_running);
+		atomic_sub(to_create, &wq->worker_refs);
+		raw_spin_unlock(&wq->lock);
+	}
+}
+
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 {
 	int ret, i;
@@ -1187,6 +1214,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	if (ret)
 		goto err;
 
+	pre_create_workers(wq);
 	return wq;
 err:
 	io_wq_put_hash(data->hash);

-- 
Jens Axboe


  parent reply	other threads:[~2023-11-13 22:15 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe [this message]
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=483d8017-4d15-4367-bac4-0e420cb4eb54@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=regressions@lists.linux.dev \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.