All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Timothy Pearson <tpearson@raptorengineering.com>
Cc: regressions <regressions@lists.linux.dev>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	npiggin <npiggin@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Wed, 15 Nov 2023 12:00:53 -0700	[thread overview]
Message-ID: <bba85bd7-6f09-4d78-b33b-caa6a6dcb29c@kernel.dk> (raw)
In-Reply-To: <4e171c09-bb9a-422f-b92c-230bb1238b22@kernel.dk>

On 11/15/23 11:30 AM, Jens Axboe wrote:
> On 11/15/23 10:03 AM, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jens Axboe" <axboe@kernel.dk>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
>>> Sent: Wednesday, November 15, 2023 10:46:58 AM
>>> Subject: Re: Regression in io_uring, leading to data corruption
>>
>>> On 11/15/23 4:03 AM, Timothy Pearson wrote:
>>>> I haven't had much success in getting the IPI path to work properly,
>>>> but by leaving task_work_add() in TWA_SIGNAL_NO_IPI mode I was at
>>>> least able to narrow down one of the areas that's going wrong.  Bear
>>>> in mind that as soon as I reenable IPI the corruption returns with a
>>>> vengeance, so this is not the correct fix yet by any means -- I am
>>>> currently soliciting feedback on what else might be going wrong at
>>>> this point since I've already spent a couple of weeks on this and am
>>>> not sure how much more time I can spend before we just have to shut
>>>> io_uring down on ppc64 for the forseeable future.
>>>>
>>>> Whatever the root cause actually is, something is *very* sensitive to
>>>> timing in both the worker thread creation path and the io_queue_sqe()
>>>> / io_queue_async() paths.  I can make the corruption disappear by
>>>> adding a udelay(1000) before io_queue_async() in the io_queue_sqe()
>>>> function, however no amount of memory barriers in the io_queue_async()
>>>> path (including in the kbuf recycling code) will fully resolve the
>>>> problem.
>>>>
>>>> Jens, would a small delay like that in io_queue_sqe() reduce the
>>>> amount of workers being created overall?  I know with some of the
>>>> other delay locations worker allocation was changing, from what I see
>>>> this one wouldn't seem to have much effect, but I'm still looking for
>>>> a sanity check.  If we're needing to wait for a millisecond for some
>>>> other thread to complete before moving on that might be valuable
>>>> information -- would also potentially tie in to the IPI path still
>>>> malfunctioning as the worker would immediately start executing.
>>>
>>> If io_queue_sqe() ultimately ends up punting to io-wq for this request,
>>> then yes doing a 1ms delay in there would ultimately then need to a 1ms
>>> delay before we either pass to an existing worker or create a new one.
>>>
>>>> On a related note, how is inter-thread safety of the io_kiocb buffer
>>>> list guaranteed, especially on weak memory model systems?  As I
>>>> understand it, different workers running on different cores could
>>>> potentially be interacting with the same kiocb request and the same
>>>> buffer list, and that does dovetail with the fact that punting to a
>>>> different I/O worker (usually on another core) seems to provoke the
>>>> problem.  I tried adding memory barriers to some of the basic recycle
>>>> functions without too much success -- it seemed to help somewhat, but
>>>> nowhere near complete resolution, and the buffers are used in a number
>>>> of other places I didn't even try to poke at.  I wanted to get some
>>>> feedback on this concept before going down yet another rabbit hole...
>>>
>>> This relies on the fact that we grab the wq lock before inserting this
>>> work, and the unlocking will be a barrier. It's important to note that
>>> this isn't any different than from before io-wq was using native
>>> workers, the only difference is that it used to be kthreads before, and
>>> now it's native threads to the application. The kthreads did a bunch of
>>> work to assume the necessary identity to do the read or write operation
>>> (which is ultimately why that approach went away, as it was just
>>> inherently unsafe), whereas the native threads do not as they already
>>> have what they need.
>>>
>>> I had a patch that just punted to a kthread and did the necessary
>>> kthread_use_mm(), perform op, kthread_unuse_mm() and it works fine at
>>> that point. Within the existing code...
>>
>> Would you happen to have that patch still?  It would provide a
>> possible starting point for figuring out the exact difference.  If not
>> I guess I could hack something similar up.
> 
> Let me see if I can find it, and make sure it applies on the current
> tree. I'll send you one in a bit.

Wrote a new one. This one has two different ways it can work:

1) By default, it uses the native io workers still, but rather than add
it to a list of pending items, it creates a new worker for each work
item. This means all writes that would've gone to io-wq will now just
fork a native worker and perform the write in a blocking fashion.

2) The fallback path for the above is that we punt it to a kthread,
which does the mm dance. This is similar to what we did before the
native workers. The fallback path is only hit if the worker creation
fails, but you can make it happen every time by just uncommenting that
return 1 in io_rewrite_io_thread().

The interesting thing about approach 1 is that while it still uses the
native workers, it will not need to be processing task_work and hence
signaling with TWA_SIGNAL or TWA_SIGNAL_NO_IPI to get it done. It simply
forks off a new worker every time, which does the work, then exits.

First try it as-is and see if that reproduces the issue. If it does,
then try uncommenting that return 1 mentioned in #2 above.


diff --git a/io_uring/rw.c b/io_uring/rw.c
index 64390d4e20c1..77e408bdb169 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -968,6 +968,8 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
+static int io_rewrite_queue(struct io_kiocb *req);
+
 int io_write(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
@@ -1071,7 +1073,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 
 			if (kiocb->ki_flags & IOCB_WRITE)
 				io_req_end_write(req);
-			return ret ? ret : -EAGAIN;
+			if (io_rewrite_queue(req))
+				return -EAGAIN;
+			return IOU_ISSUE_SKIP_COMPLETE;
 		}
 done:
 		ret = kiocb_done(req, ret2, issue_flags);
@@ -1082,7 +1086,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (!ret) {
 			if (kiocb->ki_flags & IOCB_WRITE)
 				io_req_end_write(req);
-			return -EAGAIN;
+			if (io_rewrite_queue(req))
+				return -EAGAIN;
+			return IOU_ISSUE_SKIP_COMPLETE;
 		}
 		return ret;
 	}
@@ -1092,6 +1098,79 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+struct koffload {
+	struct work_struct work;
+	struct io_kiocb *req;
+	struct mm_struct *mm;
+};
+
+static void io_rewrite(struct work_struct *work)
+{
+	struct koffload *k = container_of(work, struct koffload, work);
+	unsigned issue_flags = IO_URING_F_UNLOCKED;
+	int ret;
+
+	kthread_use_mm(k->mm);
+	ret = io_write(k->req, issue_flags);
+	kthread_unuse_mm(k->mm);
+	mmput(k->mm);
+
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_req_complete_post(k->req, issue_flags);
+	kfree(k);
+}
+
+static int io_write_io_thread(void *data)
+{
+	struct io_kiocb *req = data;
+	unsigned issue_flags = IO_URING_F_UNLOCKED;
+	int ret;
+
+	ret = io_write(req, issue_flags);
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_req_complete_post(req, issue_flags);
+
+	do_exit(0);
+}
+
+static int io_rewrite_io_thread(struct io_kiocb *req)
+{
+	struct task_struct *tsk;
+
+	/*
+	 * Uncomment this one to ALWAYS punt to a kthread
+	 */
+	// return 1;
+
+	tsk = create_io_thread(io_write_io_thread, req, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		wake_up_new_task(tsk);
+		return 0;
+	}
+
+	printk("%s: err=%ld\n", __FUNCTION__, PTR_ERR(tsk));
+	return 1;
+}
+
+static int io_rewrite_queue(struct io_kiocb *req)
+{
+	struct koffload *k;
+
+	if (!io_rewrite_io_thread(req))
+		return 0;
+
+	k = kmalloc(sizeof(*k), GFP_NOIO);
+	if (!k)
+		return 1;
+
+	INIT_WORK(&k->work, io_rewrite);
+	k->req = req;
+	mmget(current->mm);
+	k->mm = current->mm;
+	queue_work(system_wq, &k->work);
+	return 0;
+}
+
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;

-- 
Jens Axboe


  parent reply	other threads:[~2023-11-15 19:00 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe [this message]
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bba85bd7-6f09-4d78-b33b-caa6a6dcb29c@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=regressions@lists.linux.dev \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.