io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] io_uring: fix possible deadlock in io_uring_poll
@ 2021-02-02 19:52 Hao Xu
  2021-02-03  0:04 ` Pavel Begunkov
  0 siblings, 1 reply; 6+ messages in thread
From: Hao Xu @ 2021-02-02 19:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Abaci reported follow issue:

[   30.615891] ======================================================
[   30.616648] WARNING: possible circular locking dependency detected
[   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[   30.618035] ------------------------------------------------------
[   30.618914] a.out/1128 is trying to acquire lock:
[   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[   30.620505]
[   30.620505] but task is already holding lock:
[   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.622349]
[   30.622349] which lock already depends on the new lock.
[   30.622349]
[   30.623289]
[   30.623289] the existing dependency chain (in reverse order) is:
[   30.624243]
[   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[   30.625263]        lock_acquire+0x2c7/0x390
[   30.625868]        __mutex_lock+0xae/0x9f0
[   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
[   30.627278]        io_uring_poll+0xcb/0xd0
[   30.627890]        ep_item_poll.isra.14+0x4e/0x90
[   30.628531]        do_epoll_ctl+0xb7e/0x1120
[   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
[   30.629770]        do_syscall_64+0x2d/0x40
[   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.631187]
[   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[   30.631985]        check_prevs_add+0x226/0xb00
[   30.632584]        __lock_acquire+0x1237/0x13a0
[   30.633207]        lock_acquire+0x2c7/0x390
[   30.633740]        __mutex_lock+0xae/0x9f0
[   30.634258]        __ep_eventpoll_poll+0x9f/0x220
[   30.634879]        __io_arm_poll_handler+0xbf/0x220
[   30.635462]        io_issue_sqe+0xa6b/0x13e0
[   30.635982]        __io_queue_sqe+0x10b/0x550
[   30.636648]        io_queue_sqe+0x235/0x470
[   30.637281]        io_submit_sqes+0xcce/0xf10
[   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.638465]        do_syscall_64+0x2d/0x40
[   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.639643]
[   30.639643] other info that might help us debug this:
[   30.639643]
[   30.640618]  Possible unsafe locking scenario:
[   30.640618]
[   30.641402]        CPU0                    CPU1
[   30.641938]        ----                    ----
[   30.642664]   lock(&ctx->uring_lock);
[   30.643425]                                lock(&ep->mtx);
[   30.644498]                                lock(&ctx->uring_lock);
[   30.645668]   lock(&ep->mtx);
[   30.646321]
[   30.646321]  *** DEADLOCK ***
[   30.646321]
[   30.647642] 1 lock held by a.out/1128:
[   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.649954]
[   30.649954] stack backtrace:
[   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.652290] Call Trace:
[   30.652688]  dump_stack+0xac/0xe3
[   30.653164]  check_noncircular+0x11e/0x130
[   30.653747]  ? check_prevs_add+0x226/0xb00
[   30.654303]  check_prevs_add+0x226/0xb00
[   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
[   30.655564]  __lock_acquire+0x1237/0x13a0
[   30.656262]  lock_acquire+0x2c7/0x390
[   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.658014]  __mutex_lock+0xae/0x9f0
[   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.659112]  ? mark_held_locks+0x5a/0x80
[   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[   30.660885]  ? trace_hardirqs_on+0x46/0x110
[   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.662696]  __ep_eventpoll_poll+0x9f/0x220
[   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
[   30.663875]  __io_arm_poll_handler+0xbf/0x220
[   30.664463]  io_issue_sqe+0xa6b/0x13e0
[   30.664984]  ? __lock_acquire+0x782/0x13a0
[   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.666170]  ? __io_queue_sqe+0x10b/0x550
[   30.666725]  __io_queue_sqe+0x10b/0x550
[   30.667252]  ? __fget_files+0x131/0x260
[   30.667791]  ? io_req_prep+0xd8/0x1090
[   30.668316]  ? io_queue_sqe+0x235/0x470
[   30.668868]  io_queue_sqe+0x235/0x470
[   30.669398]  io_submit_sqes+0xcce/0xf10
[   30.669931]  ? xa_load+0xe4/0x1c0
[   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
[   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
[   30.672380]  do_syscall_64+0x2d/0x40
[   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.673503] RIP: 0033:0x7fd89c813239
[   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000

This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list when we fail to get the uring
lock. This leads to less accuracy, but is still ok.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---

Here I use mutex_trylock() to fix this issue, but this causes loss of
accuracy. I think doing cqring overflow flush in a task work maybe a
better solution. I'm think of this. Any thoughts?

 fs/io_uring.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 38c6cbe1ab38..866e45d42ac7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8718,7 +8718,36 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	smp_rmb();
 	if (!io_sqring_full(ctx))
 		mask |= EPOLLOUT | EPOLLWRNORM;
-	io_cqring_overflow_flush(ctx, false, NULL, NULL);
+
+	if (test_bit(0, &ctx->cq_check_overflow)) {
+		bool should_flush = true;
+		/* iopoll syncs against uring_lock, not completion_lock */
+		if (ctx->flags & IORING_SETUP_IOPOLL) {
+			/*
+			 * avoid ABBA deadlock.
+			 * there could be contention like below:
+			 *      CPU0                    CPU1
+			 *      ----                    ----
+			 * lock(&ctx->uring_lock);
+			 *                              lock(&ep->mtx);
+			 *                              lock(&ctx->uring_lock);
+			 * lock(&ep->mtx);
+			 *
+			 * this might happen if we do epoll_wait on a uring fd while
+			 * read/write the former epoll fd in a sqe in the former uring
+			 * instance.
+			 * We don't flush cqring overflow list when we fail to get the
+			 * uring lock. This leads to less accuracy, but is still ok.
+			 */
+			should_flush = mutex_trylock(&ctx->uring_lock);
+		}
+		if (should_flush) {
+			__io_cqring_overflow_flush(ctx, false, NULL, NULL);
+			if (ctx->flags & IORING_SETUP_IOPOLL)
+				mutex_unlock(&ctx->uring_lock);
+		}
+	}
+
 	if (io_cqring_events(ctx))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] io_uring: fix possible deadlock in io_uring_poll
  2021-02-02 19:52 [PATCH] io_uring: fix possible deadlock in io_uring_poll Hao Xu
@ 2021-02-03  0:04 ` Pavel Begunkov
  2021-02-03  1:48   ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Begunkov @ 2021-02-03  0:04 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 02/02/2021 19:52, Hao Xu wrote:
> This might happen if we do epoll_wait on a uring fd while reading/writing
> the former epoll fd in a sqe in the former uring instance.
> So let's don't flush cqring overflow list when we fail to get the uring
> lock. This leads to less accuracy, but is still ok.

if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
        mask |= EPOLLIN | EPOLLRDNORM;

Instead of flushing. It'd make sense if we define poll as "there might
be something, go do your peek/wait with overflow checks". Jens, is that
documented anywhere?

> 
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
> 
> Here I use mutex_trylock() to fix this issue, but this causes loss of
> accuracy. I think doing cqring overflow flush in a task work maybe a
> better solution. I'm think of this. Any thoughts?
> 
>  fs/io_uring.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 38c6cbe1ab38..866e45d42ac7 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8718,7 +8718,36 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
>  	smp_rmb();
>  	if (!io_sqring_full(ctx))
>  		mask |= EPOLLOUT | EPOLLWRNORM;
> -	io_cqring_overflow_flush(ctx, false, NULL, NULL);
> +
> +	if (test_bit(0, &ctx->cq_check_overflow)) {
> +		bool should_flush = true;
> +		/* iopoll syncs against uring_lock, not completion_lock */
> +		if (ctx->flags & IORING_SETUP_IOPOLL) {
> +			/*
> +			 * avoid ABBA deadlock.
> +			 * there could be contention like below:
> +			 *      CPU0                    CPU1
> +			 *      ----                    ----
> +			 * lock(&ctx->uring_lock);
> +			 *                              lock(&ep->mtx);
> +			 *                              lock(&ctx->uring_lock);
> +			 * lock(&ep->mtx);
> +			 *
> +			 * this might happen if we do epoll_wait on a uring fd while
> +			 * read/write the former epoll fd in a sqe in the former uring
> +			 * instance.
> +			 * We don't flush cqring overflow list when we fail to get the
> +			 * uring lock. This leads to less accuracy, but is still ok.
> +			 */
> +			should_flush = mutex_trylock(&ctx->uring_lock);
> +		}
> +		if (should_flush) {
> +			__io_cqring_overflow_flush(ctx, false, NULL, NULL);
> +			if (ctx->flags & IORING_SETUP_IOPOLL)
> +				mutex_unlock(&ctx->uring_lock);
> +		}
> +	}
> +
>  	if (io_cqring_events(ctx))
>  		mask |= EPOLLIN | EPOLLRDNORM;
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] io_uring: fix possible deadlock in io_uring_poll
  2021-02-03  0:04 ` Pavel Begunkov
@ 2021-02-03  1:48   ` Jens Axboe
  2021-02-03 16:48     ` Pavel Begunkov
  2021-02-04 16:48     ` Pavel Begunkov
  0 siblings, 2 replies; 6+ messages in thread
From: Jens Axboe @ 2021-02-03  1:48 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 2/2/21 5:04 PM, Pavel Begunkov wrote:
> On 02/02/2021 19:52, Hao Xu wrote:
>> This might happen if we do epoll_wait on a uring fd while reading/writing
>> the former epoll fd in a sqe in the former uring instance.
>> So let's don't flush cqring overflow list when we fail to get the uring
>> lock. This leads to less accuracy, but is still ok.
> 
> if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
>         mask |= EPOLLIN | EPOLLRDNORM;
> 
> Instead of flushing. It'd make sense if we define poll as "there might
> be something, go do your peek/wait with overflow checks". Jens, is that
> documented anywhere?

Nope - I actually think that the approach chosen here is pretty good,
it'll force the app to actually check and hence do what it needs to do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] io_uring: fix possible deadlock in io_uring_poll
  2021-02-03  1:48   ` Jens Axboe
@ 2021-02-03 16:48     ` Pavel Begunkov
  2021-02-04 16:48     ` Pavel Begunkov
  1 sibling, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2021-02-03 16:48 UTC (permalink / raw)
  To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi

On 03/02/2021 01:48, Jens Axboe wrote:
> On 2/2/21 5:04 PM, Pavel Begunkov wrote:
>> On 02/02/2021 19:52, Hao Xu wrote:
>>> This might happen if we do epoll_wait on a uring fd while reading/writing
>>> the former epoll fd in a sqe in the former uring instance.
>>> So let's don't flush cqring overflow list when we fail to get the uring
>>> lock. This leads to less accuracy, but is still ok.
>>
>> if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
>>         mask |= EPOLLIN | EPOLLRDNORM;
>>
>> Instead of flushing. It'd make sense if we define poll as "there might
>> be something, go do your peek/wait with overflow checks". Jens, is that
>> documented anywhere?
> 
> Nope - I actually think that the approach chosen here is pretty good,
> it'll force the app to actually check and hence do what it needs to do.

Do you mean that good is this 2-liner? I pretty like it... unless there
is a userspace app that would be broken with the change...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] io_uring: fix possible deadlock in io_uring_poll
  2021-02-03  1:48   ` Jens Axboe
  2021-02-03 16:48     ` Pavel Begunkov
@ 2021-02-04 16:48     ` Pavel Begunkov
  2021-02-04 16:54       ` Hao Xu
  1 sibling, 1 reply; 6+ messages in thread
From: Pavel Begunkov @ 2021-02-04 16:48 UTC (permalink / raw)
  To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi

On 03/02/2021 01:48, Jens Axboe wrote:
> On 2/2/21 5:04 PM, Pavel Begunkov wrote:
>> On 02/02/2021 19:52, Hao Xu wrote:
>>> This might happen if we do epoll_wait on a uring fd while reading/writing
>>> the former epoll fd in a sqe in the former uring instance.
>>> So let's don't flush cqring overflow list when we fail to get the uring
>>> lock. This leads to less accuracy, but is still ok.
>>
>> if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
>>         mask |= EPOLLIN | EPOLLRDNORM;
>>
>> Instead of flushing. It'd make sense if we define poll as "there might
>> be something, go do your peek/wait with overflow checks". Jens, is that
>> documented anywhere?
> 
> Nope - I actually think that the approach chosen here is pretty good,
> it'll force the app to actually check and hence do what it needs to do.

Ok, seems we agree on that.

Hao, can you send an updated patch?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] io_uring: fix possible deadlock in io_uring_poll
  2021-02-04 16:48     ` Pavel Begunkov
@ 2021-02-04 16:54       ` Hao Xu
  0 siblings, 0 replies; 6+ messages in thread
From: Hao Xu @ 2021-02-04 16:54 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/5 上午12:48, Pavel Begunkov 写道:
> On 03/02/2021 01:48, Jens Axboe wrote:
>> On 2/2/21 5:04 PM, Pavel Begunkov wrote:
>>> On 02/02/2021 19:52, Hao Xu wrote:
>>>> This might happen if we do epoll_wait on a uring fd while reading/writing
>>>> the former epoll fd in a sqe in the former uring instance.
>>>> So let's don't flush cqring overflow list when we fail to get the uring
>>>> lock. This leads to less accuracy, but is still ok.
>>>
>>> if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
>>>          mask |= EPOLLIN | EPOLLRDNORM;
>>>
>>> Instead of flushing. It'd make sense if we define poll as "there might
>>> be something, go do your peek/wait with overflow checks". Jens, is that
>>> documented anywhere?
>>
>> Nope - I actually think that the approach chosen here is pretty good,
>> it'll force the app to actually check and hence do what it needs to do.
> 
> Ok, seems we agree on that.
> 
> Hao, can you send an updated patch?
> 
Sure, will send v2 soon.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-02-04 16:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02 19:52 [PATCH] io_uring: fix possible deadlock in io_uring_poll Hao Xu
2021-02-03  0:04 ` Pavel Begunkov
2021-02-03  1:48   ` Jens Axboe
2021-02-03 16:48     ` Pavel Begunkov
2021-02-04 16:48     ` Pavel Begunkov
2021-02-04 16:54       ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).