linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG][RFC] Miscalculated inflight counter in io_uring
@ 2019-10-25  9:48 Pavel Begunkov
  2019-10-25 15:32 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Pavel Begunkov @ 2019-10-25  9:48 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1591 bytes --]

In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:

@inflight count returned by io_submit_sqes() is the number of entries
picked up from a sqring including already completed/failed. And
completed but not failed requests will be placed into @poll_list.

Then io_sq_thread() tries to poll @inflight events, even though failed
won't appear in @poll_list. Thus, it will think that there are always
something to poll (i.e. @inflight > 0)

There are several issues with this:
1. io_sq_thread won't ever sleep
2. io_sq_thread() may be left running and actively polling even after
user process is destroyed
3. the same goes for mm_struct with all vmas of the user process
TL;DR;
awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
recycling of vmas used for rings' mapping, which hold refcount of
io_uring's struct file. Thus, io_uring_release() won't be called, as
well as kthread_{park,stop}(). That's all in case when the user process
haven't unmapped rings.


I'm not sure how to fix it better:
1. try to put failed into poll_list (grabbing mutex).

2. test for zero-inflight case with comparing sq and cq. something like
```
if (nr_polled == 0) {
	lock(comp_lock);
	if (cached_cq_head == cached_sq_tail)
		inflight = 0;
	unlock(comp_lock);
}
```
But that's adds extra spinlock locking in fast-path. And that's unsafe
to use non-cached heads/tails, as it could be maliciously changed by
userspace.

3. Do some counting of failed (probably needs atomic or synchronisation)

4. something else?


-- 
Yours sincerely,
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] Miscalculated inflight counter in io_uring
  2019-10-25  9:48 [BUG][RFC] Miscalculated inflight counter in io_uring Pavel Begunkov
@ 2019-10-25 15:32 ` Jens Axboe
  2019-10-25 16:08   ` Pavel Begunkov
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2019-10-25 15:32 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, linux-kernel

On 10/25/19 3:48 AM, Pavel Begunkov wrote:
> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
> 
> @inflight count returned by io_submit_sqes() is the number of entries
> picked up from a sqring including already completed/failed. And
> completed but not failed requests will be placed into @poll_list.
> 
> Then io_sq_thread() tries to poll @inflight events, even though failed
> won't appear in @poll_list. Thus, it will think that there are always
> something to poll (i.e. @inflight > 0)
> 
> There are several issues with this:
> 1. io_sq_thread won't ever sleep
> 2. io_sq_thread() may be left running and actively polling even after
> user process is destroyed
> 3. the same goes for mm_struct with all vmas of the user process
> TL;DR;
> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
> recycling of vmas used for rings' mapping, which hold refcount of
> io_uring's struct file. Thus, io_uring_release() won't be called, as
> well as kthread_{park,stop}(). That's all in case when the user process
> haven't unmapped rings.
> 
> 
> I'm not sure how to fix it better:
> 1. try to put failed into poll_list (grabbing mutex).
> 
> 2. test for zero-inflight case with comparing sq and cq. something like
> ```
> if (nr_polled == 0) {
> 	lock(comp_lock);
> 	if (cached_cq_head == cached_sq_tail)
> 		inflight = 0;
> 	unlock(comp_lock);
> }
> ```
> But that's adds extra spinlock locking in fast-path. And that's unsafe
> to use non-cached heads/tails, as it could be maliciously changed by
> userspace.
> 
> 3. Do some counting of failed (probably needs atomic or synchronisation)
> 
> 4. something else?

Can we just look at the completion count? Ala:

prev_tail = ctx->cached_cq_tail;
inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,      
                                             mm_fault);
if (prev_tail != ctx->cached_cq_tail)
	inflight -= (ctx->cached_cq_tail - prev_tail);

or something like that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] Miscalculated inflight counter in io_uring
  2019-10-25 15:32 ` Jens Axboe
@ 2019-10-25 16:08   ` Pavel Begunkov
  2019-10-25 16:18     ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Pavel Begunkov @ 2019-10-25 16:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 3211 bytes --]

On 25/10/2019 18:32, Jens Axboe wrote:
> On 10/25/19 3:48 AM, Pavel Begunkov wrote:
>> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
>>
>> @inflight count returned by io_submit_sqes() is the number of entries
>> picked up from a sqring including already completed/failed. And
>> completed but not failed requests will be placed into @poll_list.
>>
>> Then io_sq_thread() tries to poll @inflight events, even though failed
>> won't appear in @poll_list. Thus, it will think that there are always
>> something to poll (i.e. @inflight > 0)
>>
>> There are several issues with this:
>> 1. io_sq_thread won't ever sleep
>> 2. io_sq_thread() may be left running and actively polling even after
>> user process is destroyed
>> 3. the same goes for mm_struct with all vmas of the user process
>> TL;DR;
>> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
>> recycling of vmas used for rings' mapping, which hold refcount of
>> io_uring's struct file. Thus, io_uring_release() won't be called, as
>> well as kthread_{park,stop}(). That's all in case when the user process
>> haven't unmapped rings.
>>
>>
>> I'm not sure how to fix it better:
>> 1. try to put failed into poll_list (grabbing mutex).
>>
>> 2. test for zero-inflight case with comparing sq and cq. something like
>> ```
>> if (nr_polled == 0) {
>> 	lock(comp_lock);
>> 	if (cached_cq_head == cached_sq_tail)
>> 		inflight = 0;
>> 	unlock(comp_lock);
>> }
>> ```
>> But that's adds extra spinlock locking in fast-path. And that's unsafe
>> to use non-cached heads/tails, as it could be maliciously changed by
>> userspace.
>>
>> 3. Do some counting of failed (probably needs atomic or synchronisation)
>>
>> 4. something else?
> 
> Can we just look at the completion count? Ala:
> 
> prev_tail = ctx->cached_cq_tail;
> inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,      
>                                              mm_fault);
> if (prev_tail != ctx->cached_cq_tail)
> 	inflight -= (ctx->cached_cq_tail - prev_tail);
> 
> or something like that.
> 

Don't think so:
1. @cached_cq_tail is protected be @completion_lock. (right?)
Never know what happens, when you violate a memory model.
2. if something is successfully completed by that time, we again get
the wrong number.

Basically, it's
inflight = (cached_sq_head - cached_cq_tail) + len(poll_list)
maybe you can figure out something from this.

idea 1:
How about to count failed events and subtract it?
But as they may fail asynchronously need synchronisation
e.g. atomic_add() for fails (fail, slow-path)
and atomic_load() in kthread (fast-path)


BTW, tested the patch below before, it fixes the issue, but is racy
for the same reason 1.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 32f6598ecae9..0353d374a0d5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2650,6 +2650,10 @@ static int io_sq_thread(void *data)
 		bool mm_fault = false;
 		unsigned int to_submit;
 
+		if (ctx->cached_sq_head == ctx->cached_cq_tail +
+			ctx->rings->sq_dropped)
+			inflight = 0;
+
 		if (inflight) {
 			unsigned nr_events = 0;
 


-- 
Yours sincerely,
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [BUG][RFC] Miscalculated inflight counter in io_uring
  2019-10-25 16:08   ` Pavel Begunkov
@ 2019-10-25 16:18     ` Jens Axboe
  0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2019-10-25 16:18 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, linux-kernel

On 10/25/19 10:08 AM, Pavel Begunkov wrote:
> On 25/10/2019 18:32, Jens Axboe wrote:
>> On 10/25/19 3:48 AM, Pavel Begunkov wrote:
>>> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
>>>
>>> @inflight count returned by io_submit_sqes() is the number of entries
>>> picked up from a sqring including already completed/failed. And
>>> completed but not failed requests will be placed into @poll_list.
>>>
>>> Then io_sq_thread() tries to poll @inflight events, even though failed
>>> won't appear in @poll_list. Thus, it will think that there are always
>>> something to poll (i.e. @inflight > 0)
>>>
>>> There are several issues with this:
>>> 1. io_sq_thread won't ever sleep
>>> 2. io_sq_thread() may be left running and actively polling even after
>>> user process is destroyed
>>> 3. the same goes for mm_struct with all vmas of the user process
>>> TL;DR;
>>> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
>>> recycling of vmas used for rings' mapping, which hold refcount of
>>> io_uring's struct file. Thus, io_uring_release() won't be called, as
>>> well as kthread_{park,stop}(). That's all in case when the user process
>>> haven't unmapped rings.
>>>
>>>
>>> I'm not sure how to fix it better:
>>> 1. try to put failed into poll_list (grabbing mutex).
>>>
>>> 2. test for zero-inflight case with comparing sq and cq. something like
>>> ```
>>> if (nr_polled == 0) {
>>> 	lock(comp_lock);
>>> 	if (cached_cq_head == cached_sq_tail)
>>> 		inflight = 0;
>>> 	unlock(comp_lock);
>>> }
>>> ```
>>> But that's adds extra spinlock locking in fast-path. And that's unsafe
>>> to use non-cached heads/tails, as it could be maliciously changed by
>>> userspace.
>>>
>>> 3. Do some counting of failed (probably needs atomic or synchronisation)
>>>
>>> 4. something else?
>>
>> Can we just look at the completion count? Ala:
>>
>> prev_tail = ctx->cached_cq_tail;
>> inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,
>>                                               mm_fault);
>> if (prev_tail != ctx->cached_cq_tail)
>> 	inflight -= (ctx->cached_cq_tail - prev_tail);
>>
>> or something like that.
>>
> 
> Don't think so:
> 1. @cached_cq_tail is protected be @completion_lock. (right?)
> Never know what happens, when you violate a memory model.
> 2. if something is successfully completed by that time, we again get
> the wrong number.
> 
> Basically, it's
> inflight = (cached_sq_head - cached_cq_tail) + len(poll_list)
> maybe you can figure out something from this.
> 
> idea 1:
> How about to count failed events and subtract it?
> But as they may fail asynchronously need synchronisation
> e.g. atomic_add() for fails (fail, slow-path)
> and atomic_load() in kthread (fast-path)

How about we just go way simpler? We only really care about if we have
any inflight requests for polling, or none at all. We don't care about
the count of them. So if we check the poll_list, and if it's empty, then
we just reset inflight. That should handle it without any extra weird
accounting, or racy logic.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9c128df3d675..afc463a06fdc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -874,19 +874,12 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
 	mutex_unlock(&ctx->uring_lock);
 }
 
-static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
-			   long min)
+static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			    long min)
 {
-	int iters, ret = 0;
-
-	/*
-	 * We disallow the app entering submit/complete with polling, but we
-	 * still need to lock the ring to prevent racing with polled issue
-	 * that got punted to a workqueue.
-	 */
-	mutex_lock(&ctx->uring_lock);
+	int iters, ret;
 
-	iters = 0;
+	ret = iters = 0;
 	do {
 		int tmin = 0;
 
@@ -922,6 +915,21 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
 		ret = 0;
 	} while (min && !*nr_events && !need_resched());
 
+	return ret;
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			   long min)
+{
+	int ret;
+
+	/*
+	 * We disallow the app entering submit/complete with polling, but we
+	 * still need to lock the ring to prevent racing with polled issue
+	 * that got punted to a workqueue.
+	 */
+	mutex_lock(&ctx->uring_lock);
+	ret = __io_iopoll_check(ctx, nr_events, min);
 	mutex_unlock(&ctx->uring_lock);
 	return ret;
 }
@@ -2658,7 +2666,12 @@ static int io_sq_thread(void *data)
 			unsigned nr_events = 0;
 
 			if (ctx->flags & IORING_SETUP_IOPOLL) {
-				io_iopoll_check(ctx, &nr_events, 0);
+				mutex_lock(&ctx->uring_lock);
+				if (!list_empty(&ctx->poll_list))
+					__io_iopoll_check(ctx, &nr_events, 0);
+				else
+					inflight = 0;
+				mutex_unlock(&ctx->uring_lock);
 			} else {
 				/*
 				 * Normal IO, just pretend everything completed.

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-10-25 16:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25  9:48 [BUG][RFC] Miscalculated inflight counter in io_uring Pavel Begunkov
2019-10-25 15:32 ` Jens Axboe
2019-10-25 16:08   ` Pavel Begunkov
2019-10-25 16:18     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).