All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Timothy Pearson <tpearson@raptorengineering.com>
Cc: regressions <regressions@lists.linux.dev>,
	Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: Regression in io_uring, leading to data corruption
Date: Wed, 8 Nov 2023 12:06:53 -0700	[thread overview]
Message-ID: <2225dc79-37ec-4239-b13a-ac444bfa6a6b@kernel.dk> (raw)
In-Reply-To: <410961969.45826785.1699468561883.JavaMail.zimbra@raptorengineeringinc.com>

On 11/8/23 11:36 AM, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Jens Axboe" <axboe@kernel.dk>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov" <asml.silence@gmail.com>
>> Sent: Wednesday, November 8, 2023 11:57:59 AM
>> Subject: Re: Regression in io_uring, leading to data corruption
> 
>> On 11/8/23 10:49 AM, Jens Axboe wrote:
>>> On 11/8/23 10:40 AM, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>> <asml.silence@gmail.com>
>>>>> Sent: Wednesday, November 8, 2023 11:26:53 AM
>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>
>>>>> On 11/8/23 10:10 AM, Timothy Pearson wrote:
>>>>>> ----- Original Message -----
>>>>>>> From: "Jens Axboe" <axboe@kernel.dk>
>>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>>>>> Cc: "regressions" <regressions@lists.linux.dev>, "Pavel Begunkov"
>>>>>>> <asml.silence@gmail.com>
>>>>>>> Sent: Wednesday, November 8, 2023 9:14:55 AM
>>>>>>> Subject: Re: Regression in io_uring, leading to data corruption
>>>>>>
>>>>>>> On 11/8/23 8:10 AM, Jens Axboe wrote:
>>>>>>>> It could also be a task that has pending IO and is doing exec() (and
>>>>>>>> friends), this would also cancel inflight IO.
>>>>>>>
>>>>>>> If this is the case, then you could try with this one to just disable
>>>>>>> that and see if the corruption goes away:
>>>>>>
>>>>>> Unfortunately that had no effect on the corruption.  I've also traced
>>>>>> the signal generation into the get_signal() call, which is apparently
>>>>>> sending SIGKILL when the thread group is marked for termination --
>>>>>> this is in turn why the PID fields etc. are all zero.
>>>>>
>>>>> That's good news though, because I'm continually pondering why powerpc
>>>>> is different here.
>>>>>
>>>>>> Investigation continues.
>>>>>
>>>>> If it's not exec, then it has to be a signal. I'm assuming you're
>>>>> hitting this in get_signal():
>>>>>
>>>>> 		/* Has this task already been marked for death? */
>>>>> 		if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>>>>> 		     signal->group_exec_task) {
>>>>> 			clear_siginfo(&ksig->info);
>>>>> 			ksig->info.si_signo = signr = SIGKILL;
>>>>> 			sigdelset(&current->pending.signal, SIGKILL);
>>>>> 			trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>>>>> 				&sighand->action[SIGKILL - 1]);
>>>>> 			recalc_sigpending();
>>>>> 			goto fatal;
>>>>> 		}
>>>>>
>>>>> which is either exec (which we verified it is not), so I don't see
>>>>> anything other than this being a signal sent to mtr/mariadb for exit.
>>>>>
>>>>> Does this trigger? Doesn't necessarily indicate a bug as it would be
>>>>> valid, but if it does trigger, perhaps io-wq has unstarted requests at
>>>>> this point and they get canceled and hence never written. If this does
>>>>> trigger, maybe try and do your sleep trick there too and see if that
>>>>> gets rid of it.
>>>>
>>>> Yes, it does indeed trigger.  Is there a way to directly check for the
>>>> unstarted requests?
>>>
>>> Let me hack up a debug patch for this, give me a minute.
>>
>> This should do it - whenever this condition hits, you should see
>> something ala:
>>
>> [   97.960877] io_wq_dump: work_items=0, cur=0, next=0
>>
>> in dmesg. work_items is the number of work items we found that haven't
>> been scheduled yet. cur is what a worker is currently processing, and
>> next is basically a way for cancel to find a work item before it gets
>> assigned. work_items and next may get canceled, work_items should always
>> finish for storage IO, since signals don't interrupt them.
>>
>>
>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>> index 522196dfb0ff..643b8e9de518 100644
>> --- a/io_uring/io-wq.c
>> +++ b/io_uring/io-wq.c
>> @@ -553,6 +553,8 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
>> 	struct io_wq *wq = worker->wq;
>> 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
>>
>> +	WARN_ON_ONCE(do_kill);
>> +
>> 	do {
>> 		struct io_wq_work *work;
>>
>> @@ -889,6 +891,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void
>> *data)
>> static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
>> {
>> 	do {
>> +		WARN_ON_ONCE(1);
>> 		work->flags |= IO_WQ_WORK_CANCEL;
>> 		wq->do_work(work);
>> 		work = wq->free_work(work);
>> @@ -934,6 +937,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work
>> *work)
>> 	 */
>> 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state) ||
>> 	    (work->flags & IO_WQ_WORK_CANCEL)) {
>> +		WARN_ON_ONCE(1);
>> 		io_run_cancel(work, wq);
>> 		return;
>> 	}
>> @@ -1369,6 +1373,54 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
>> 	return 0;
>> }
>>
>> +struct worker_lookup {
>> +	int cur_work;
>> +	int next_work;
>> +};
>> +
>> +static bool io_wq_worker_lookup(struct io_worker *worker, void *data)
>> +{
>> +	struct worker_lookup *l = data;
>> +
>> +	raw_spin_lock(&worker->lock);
>> +	if (worker->cur_work)
>> +		l->cur_work++;
>> +	if (worker->next_work)
>> +		l->next_work++;
>> +	raw_spin_unlock(&worker->lock);
>> +	return false;
>> +}
>> +
>> +void io_wq_dump(struct io_uring_task *tctx)
>> +{
>> +	struct io_wq_work_node *node, *prev;
>> +	struct io_wq *wq = tctx->io_wq;
>> +	struct worker_lookup l = { };
>> +	int i, work_items;
>> +
>> +	if (!wq) {
>> +		printk("%s: no wq\n", __FUNCTION__);
>> +		return;
>> +	}
>> +
>> +	work_items = 0;
>> +	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
>> +		struct io_wq_acct *acct = io_get_acct(wq, i == 0);
>> +
>> +		raw_spin_lock(&acct->lock);
>> +		wq_list_for_each(node, prev, &acct->work_list)
>> +			work_items++;
>> +		raw_spin_unlock(&acct->lock);
>> +	}
>> +
>> +	rcu_read_lock();
>> +	io_wq_for_each_worker(wq, io_wq_worker_lookup, &l);
>> +	rcu_read_unlock();
>> +
>> +	printk("%s: work_items=%d, cur=%d, next=%d\n", __FUNCTION__, work_items,
>> +							l.cur_work, l.next_work);
>> +}
>> +
>> static __init int io_wq_init(void)
>> {
>> 	int ret;
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index ed254076c723..c0bd35e5429a 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1943,7 +1943,9 @@ void io_wq_submit_work(struct io_wq_work *work)
>>
>> 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
>> 	if (work->flags & IO_WQ_WORK_CANCEL) {
>> +		WARN_ON_ONCE(1);
>> fail:
>> +		WARN_ON_ONCE(1);
>> 		io_req_task_queue_fail(req, err);
>> 		return;
>> 	}
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index ee9f43bed49a..250ae820340c 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -988,6 +988,8 @@ SYSCALL_DEFINE1(exit, int, error_code)
>> 	do_exit((error_code&0xff)<<8);
>> }
>>
>> +void io_wq_dump(struct io_uring_task *);
>> +
>> /*
>>  * Take down every thread in the group.  This is called by fatal signals
>>  * as well as by sys_exit_group (below).
>> @@ -1011,6 +1013,9 @@ do_group_exit(int exit_code)
>> 		else if (sig->group_exec_task)
>> 			exit_code = 0;
>> 		else {
>> +			if (!strncmp(current->comm, "mariadbd", 8) &&
>> +			    current->io_uring)
>> +				io_wq_dump(current->io_uring);
>> 			sig->group_exit_code = exit_code;
>> 			sig->flags = SIGNAL_GROUP_EXIT;
>> 			zap_other_threads(current);
> 
> Unfortunately it's only returning work_items=0, cur=0, next=0, so that
> was a bit of a red herring.

Well that's probably a good thing, as it also didn't make a lot of sense
:-)

> I have been giving some thought to the CPU pinning of the workers, and
> one thing that may have been overlooked is that this could potentially
> force-serialize worker operations.  Did you just have to pin the io
> workers or did the workqueue also need to be pinned for the corruption
> to disappear?

Not sure I follow, the workers ARE the workqueue. For the pinning, I
just made sure that the workers are on the same CPU. I honestly don't
remember all the details there outside of what I can read back from the
emails I sent, it's been a while. My suspicion back then was that it was
some weird ppc cache aliasing effect with the copy into kernel memory
happening on cpu X, and then we immediately punt it to cpu Y for
processing.

I didn't see the corruption happening if I just forced the requests to
complete inline (eg always on the CPU being submitted, no punt to
io-wq), and I didn't see it if I ensured that the io-wq worker was
running on the same CPU as the submitter.

-- 
Jens Axboe


  parent reply	other threads:[~2023-11-08 19:06 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 16:34 Regression in io_uring, leading to data corruption Timothy Pearson
2023-11-07 16:49 ` Jens Axboe
2023-11-07 16:57   ` Timothy Pearson
2023-11-07 17:14     ` Jens Axboe
2023-11-07 21:22 ` Jens Axboe
2023-11-07 21:39   ` Timothy Pearson
2023-11-07 21:46     ` Jens Axboe
2023-11-07 22:07       ` Timothy Pearson
2023-11-07 22:16         ` Jens Axboe
2023-11-07 22:29           ` Timothy Pearson
2023-11-07 22:44             ` Jens Axboe
2023-11-07 23:12               ` Timothy Pearson
2023-11-07 23:16                 ` Jens Axboe
2023-11-07 23:34                   ` Timothy Pearson
2023-11-07 23:52                     ` Jens Axboe
2023-11-08  0:02                       ` Timothy Pearson
2023-11-08  0:09                         ` Jens Axboe
2023-11-08  3:27                           ` Timothy Pearson
2023-11-08  3:30                             ` Timothy Pearson
2023-11-08  4:00                           ` Timothy Pearson
2023-11-08 15:10                             ` Jens Axboe
2023-11-08 15:14                               ` Jens Axboe
2023-11-08 17:10                                 ` Timothy Pearson
2023-11-08 17:26                                   ` Jens Axboe
2023-11-08 17:40                                     ` Timothy Pearson
2023-11-08 17:49                                       ` Jens Axboe
2023-11-08 17:57                                         ` Jens Axboe
2023-11-08 18:36                                           ` Timothy Pearson
2023-11-08 18:51                                             ` Timothy Pearson
2023-11-08 19:08                                               ` Jens Axboe
2023-11-08 19:06                                             ` Jens Axboe [this message]
2023-11-08 22:05                                               ` Jens Axboe
2023-11-08 22:15                                                 ` Timothy Pearson
2023-11-08 22:18                                                   ` Jens Axboe
2023-11-08 22:28                                                     ` Timothy Pearson
2023-11-08 23:58                                                     ` Jens Axboe
2023-11-09 15:12                                                       ` Jens Axboe
2023-11-09 17:00                                                         ` Timothy Pearson
2023-11-09 17:17                                                           ` Jens Axboe
2023-11-09 17:24                                                             ` Timothy Pearson
2023-11-09 17:30                                                               ` Jens Axboe
2023-11-09 17:36                                                                 ` Timothy Pearson
2023-11-09 17:38                                                                   ` Jens Axboe
2023-11-09 17:42                                                                     ` Timothy Pearson
2023-11-09 17:45                                                                       ` Jens Axboe
2023-11-09 18:20                                                                         ` tpearson
2023-11-10  3:51                                                                           ` Jens Axboe
2023-11-10  4:35                                                                             ` Timothy Pearson
2023-11-10  6:48                                                                               ` Timothy Pearson
2023-11-10 14:52                                                                                 ` Jens Axboe
2023-11-11 18:42                                                                                   ` Timothy Pearson
2023-11-11 18:58                                                                                     ` Jens Axboe
2023-11-11 19:04                                                                                       ` Timothy Pearson
2023-11-11 19:11                                                                                         ` Jens Axboe
2023-11-11 19:15                                                                                           ` Timothy Pearson
2023-11-11 19:23                                                                                             ` Jens Axboe
2023-11-11 21:57                                                                                     ` Timothy Pearson
2023-11-13 17:06                                                                                       ` Timothy Pearson
2023-11-13 17:39                                                                                         ` Jens Axboe
2023-11-13 19:02                                                                                           ` Timothy Pearson
2023-11-13 19:29                                                                                             ` Jens Axboe
2023-11-13 20:58                                                                                               ` Timothy Pearson
2023-11-13 21:22                                                                                                 ` Timothy Pearson
2023-11-13 22:15                                                                                                 ` Jens Axboe
2023-11-13 23:19                                                                                                   ` Timothy Pearson
2023-11-13 23:48                                                                                                     ` Jens Axboe
2023-11-14  0:04                                                                                                       ` Timothy Pearson
2023-11-14  0:13                                                                                                         ` Jens Axboe
2023-11-14  0:52                                                                                                           ` Timothy Pearson
2023-11-14  5:06                                                                                                             ` Timothy Pearson
2023-11-14 13:17                                                                                                               ` Jens Axboe
2023-11-14 16:59                                                                                                                 ` Timothy Pearson
2023-11-14 17:04                                                                                                                   ` Jens Axboe
2023-11-14 17:14                                                                                                                     ` Timothy Pearson
2023-11-14 17:17                                                                                                                       ` Jens Axboe
2023-11-14 17:21                                                                                                                         ` Timothy Pearson
2023-11-14 17:57                                                                                                                           ` Timothy Pearson
2023-11-14 18:02                                                                                                                             ` Jens Axboe
2023-11-14 18:12                                                                                                                               ` Timothy Pearson
2023-11-14 18:26                                                                                                                                 ` Jens Axboe
2023-11-15 11:03                                                                                                                                   ` Timothy Pearson
2023-11-15 16:46                                                                                                                                     ` Jens Axboe
2023-11-15 17:03                                                                                                                                       ` Timothy Pearson
2023-11-15 18:30                                                                                                                                         ` Jens Axboe
2023-11-15 18:35                                                                                                                                           ` Timothy Pearson
2023-11-15 18:37                                                                                                                                             ` Jens Axboe
2023-11-15 18:40                                                                                                                                               ` Timothy Pearson
2023-11-15 19:00                                                                                                                                           ` Jens Axboe
2023-11-16  3:28                                                                                                                                             ` Timothy Pearson
2023-11-16  3:46                                                                                                                                               ` Jens Axboe
2023-11-16  3:54                                                                                                                                                 ` Timothy Pearson
2023-11-19  0:16                                                                                                                                                   ` Timothy Pearson
2023-11-13 20:47                                                                                         ` Jens Axboe
2023-11-13 21:08                                                                                           ` Timothy Pearson
2023-11-10 14:48                                                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2225dc79-37ec-4239-b13a-ac444bfa6a6b@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=regressions@lists.linux.dev \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.