All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] fix deadlock in __io_req_task_submit()
@ 2021-02-03 14:57 Hao Xu
  2021-02-03 14:57 ` [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister() Hao Xu
  2021-02-03 14:57 ` [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work* Hao Xu
  0 siblings, 2 replies; 20+ messages in thread
From: Hao Xu @ 2021-02-03 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

This patchset is to fix a deadlock issue in __io_req_task_submit(), the
first patch is a prep one and the second patch does the work.

Hao Xu (2):
  io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  io_uring: don't hold uring_lock when calling io_run_task_work*

 fs/io_uring.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  2021-02-03 14:57 [PATCH 0/2] fix deadlock in __io_req_task_submit() Hao Xu
@ 2021-02-03 14:57 ` Hao Xu
  2021-02-03 16:33   ` Pavel Begunkov
  2021-02-03 14:57 ` [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work* Hao Xu
  1 sibling, 1 reply; 20+ messages in thread
From: Hao Xu @ 2021-02-03 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

io_sqe_files_unregister is currently called from several places:
    - syscall io_uring_register (with uring_lock)
    - io_ring_ctx_wait_and_kill() (without uring_lock)

There is a AA type deadlock in io_sqe_files_unregister(), thus we need
to know if we hold uring_lock in io_sqe_files_unregister() to fix the
issue.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 38c6cbe1ab38..efb6d02fea6f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7339,7 +7339,7 @@ static void io_sqe_files_set_node(struct fixed_file_data *file_data,
 	percpu_ref_get(&file_data->refs);
 }
 
-static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
 {
 	struct fixed_file_data *data = ctx->file_data;
 	struct fixed_file_ref_node *backup_node, *ref_node = NULL;
@@ -7872,13 +7872,13 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 
 	ret = io_sqe_files_scm(ctx);
 	if (ret) {
-		io_sqe_files_unregister(ctx);
+		io_sqe_files_unregister(ctx, true);
 		return ret;
 	}
 
 	ref_node = alloc_fixed_file_ref_node(ctx);
 	if (!ref_node) {
-		io_sqe_files_unregister(ctx);
+		io_sqe_files_unregister(ctx, true);
 		return -ENOMEM;
 	}
 
@@ -8682,7 +8682,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		css_put(ctx->sqo_blkcg_css);
 #endif
 
-	io_sqe_files_unregister(ctx);
+	io_sqe_files_unregister(ctx, false);
 	io_eventfd_unregister(ctx);
 	io_destroy_buffers(ctx);
 	idr_destroy(&ctx->personality_idr);
@@ -10065,7 +10065,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		ret = -EINVAL;
 		if (arg || nr_args)
 			break;
-		ret = io_sqe_files_unregister(ctx);
+		ret = io_sqe_files_unregister(ctx, true);
 		break;
 	case IORING_REGISTER_FILES_UPDATE:
 		ret = io_sqe_files_update(ctx, arg, nr_args);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 14:57 [PATCH 0/2] fix deadlock in __io_req_task_submit() Hao Xu
  2021-02-03 14:57 ` [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister() Hao Xu
@ 2021-02-03 14:57 ` Hao Xu
  2021-02-03 16:35   ` Pavel Begunkov
  2021-02-04 11:33   ` Pavel Begunkov
  1 sibling, 2 replies; 20+ messages in thread
From: Hao Xu @ 2021-02-03 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Abaci reported the below issue:
[  141.400455] hrtimer: interrupt took 205853 ns
[  189.869316] process 'usr/local/ilogtail/ilogtail_0.16.26' started with executable stack
[  250.188042]
[  250.188327] ============================================
[  250.189015] WARNING: possible recursive locking detected
[  250.189732] 5.11.0-rc4 #1 Not tainted
[  250.190267] --------------------------------------------
[  250.190917] a.out/7363 is trying to acquire lock:
[  250.191506] ffff888114dbcbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __io_req_task_submit+0x29/0xa0
[  250.192599]
[  250.192599] but task is already holding lock:
[  250.193309] ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
[  250.194426]
[  250.194426] other info that might help us debug this:
[  250.195238]  Possible unsafe locking scenario:
[  250.195238]
[  250.196019]        CPU0
[  250.196411]        ----
[  250.196803]   lock(&ctx->uring_lock);
[  250.197420]   lock(&ctx->uring_lock);
[  250.197966]
[  250.197966]  *** DEADLOCK ***
[  250.197966]
[  250.198837]  May be due to missing lock nesting notation
[  250.198837]
[  250.199780] 1 lock held by a.out/7363:
[  250.200373]  #0: ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
[  250.201645]
[  250.201645] stack backtrace:
[  250.202298] CPU: 0 PID: 7363 Comm: a.out Not tainted 5.11.0-rc4 #1
[  250.203144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  250.203887] Call Trace:
[  250.204302]  dump_stack+0xac/0xe3
[  250.204804]  __lock_acquire+0xab6/0x13a0
[  250.205392]  lock_acquire+0x2c3/0x390
[  250.205928]  ? __io_req_task_submit+0x29/0xa0
[  250.206541]  __mutex_lock+0xae/0x9f0
[  250.207071]  ? __io_req_task_submit+0x29/0xa0
[  250.207745]  ? 0xffffffffa0006083
[  250.208248]  ? __io_req_task_submit+0x29/0xa0
[  250.208845]  ? __io_req_task_submit+0x29/0xa0
[  250.209452]  ? __io_req_task_submit+0x5/0xa0
[  250.210083]  __io_req_task_submit+0x29/0xa0
[  250.210687]  io_async_task_func+0x23d/0x4c0
[  250.211278]  task_work_run+0x89/0xd0
[  250.211884]  io_run_task_work_sig+0x50/0xc0
[  250.212464]  io_sqe_files_unregister+0xb2/0x1f0
[  250.213109]  __io_uring_register+0x115a/0x1750
[  250.213718]  ? __x64_sys_io_uring_register+0xad/0x210
[  250.214395]  ? __fget_files+0x15a/0x260
[  250.214956]  __x64_sys_io_uring_register+0xbe/0x210
[  250.215620]  ? trace_hardirqs_on+0x46/0x110
[  250.216205]  do_syscall_64+0x2d/0x40
[  250.216731]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  250.217455] RIP: 0033:0x7f0fa17e5239
[  250.218034] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[  250.220343] RSP: 002b:00007f0fa1eeac48 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
[  250.221360] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fa17e5239
[  250.222272] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000008
[  250.223185] RBP: 00007f0fa1eeae20 R08: 0000000000000000 R09: 0000000000000000
[  250.224091] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  250.224999] R13: 0000000000021000 R14: 0000000000000000 R15: 00007f0fa1eeb700

This is caused by calling io_run_task_work_sig() to do work under
uring_lock while the caller io_sqe_files_unregister() already held
uring_lock.
we need to check if uring_lock is held by us when doing unlock around
io_run_task_work_sig() since there are code paths down to that place
without uring_lock held.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index efb6d02fea6f..b093977713ee 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
 
 	/* wait for all refs nodes to complete */
 	flush_delayed_work(&ctx->file_put_work);
+	if (locked)
+		mutex_unlock(&ctx->uring_lock);
 	do {
 		ret = wait_for_completion_interruptible(&data->done);
 		if (!ret)
 			break;
 		ret = io_run_task_work_sig();
-		if (ret < 0) {
-			percpu_ref_resurrect(&data->refs);
-			reinit_completion(&data->done);
-			io_sqe_files_set_node(data, backup_node);
-			return ret;
-		}
+		if (ret < 0)
+			break;
 	} while (1);
+	if (locked)
+		mutex_lock(&ctx->uring_lock);
+
+	if (ret < 0) {
+		percpu_ref_resurrect(&data->refs);
+		reinit_completion(&data->done);
+		io_sqe_files_set_node(data, backup_node);
+		return ret;
+	}
 
 	__io_sqe_files_unregister(ctx);
 	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  2021-02-03 14:57 ` [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister() Hao Xu
@ 2021-02-03 16:33   ` Pavel Begunkov
  2021-02-04  3:34     ` Hao Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-03 16:33 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 03/02/2021 14:57, Hao Xu wrote:
> io_sqe_files_unregister is currently called from several places:
>     - syscall io_uring_register (with uring_lock)
>     - io_ring_ctx_wait_and_kill() (without uring_lock)
> 
> There is a AA type deadlock in io_sqe_files_unregister(), thus we need
> to know if we hold uring_lock in io_sqe_files_unregister() to fix the
> issue.

It's ugly, just take the lock and kill the patch. There can't be any
contention during io_ring_ctx_free anyway. 

> 
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
>  fs/io_uring.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 38c6cbe1ab38..efb6d02fea6f 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7339,7 +7339,7 @@ static void io_sqe_files_set_node(struct fixed_file_data *file_data,
>  	percpu_ref_get(&file_data->refs);
>  }
>  
> -static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>  {
>  	struct fixed_file_data *data = ctx->file_data;
>  	struct fixed_file_ref_node *backup_node, *ref_node = NULL;
> @@ -7872,13 +7872,13 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>  
>  	ret = io_sqe_files_scm(ctx);
>  	if (ret) {
> -		io_sqe_files_unregister(ctx);
> +		io_sqe_files_unregister(ctx, true);
>  		return ret;
>  	}
>  
>  	ref_node = alloc_fixed_file_ref_node(ctx);
>  	if (!ref_node) {
> -		io_sqe_files_unregister(ctx);
> +		io_sqe_files_unregister(ctx, true);
>  		return -ENOMEM;
>  	}
>  
> @@ -8682,7 +8682,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>  		css_put(ctx->sqo_blkcg_css);
>  #endif
>  
> -	io_sqe_files_unregister(ctx);
> +	io_sqe_files_unregister(ctx, false);
>  	io_eventfd_unregister(ctx);
>  	io_destroy_buffers(ctx);
>  	idr_destroy(&ctx->personality_idr);
> @@ -10065,7 +10065,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>  		ret = -EINVAL;
>  		if (arg || nr_args)
>  			break;
> -		ret = io_sqe_files_unregister(ctx);
> +		ret = io_sqe_files_unregister(ctx, true);
>  		break;
>  	case IORING_REGISTER_FILES_UPDATE:
>  		ret = io_sqe_files_update(ctx, arg, nr_args);
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 14:57 ` [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work* Hao Xu
@ 2021-02-03 16:35   ` Pavel Begunkov
  2021-02-03 16:45     ` Pavel Begunkov
  2021-02-04 11:33   ` Pavel Begunkov
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-03 16:35 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 03/02/2021 14:57, Hao Xu wrote:
> This is caused by calling io_run_task_work_sig() to do work under
> uring_lock while the caller io_sqe_files_unregister() already held
> uring_lock.
> we need to check if uring_lock is held by us when doing unlock around
> io_run_task_work_sig() since there are code paths down to that place
> without uring_lock held.

1. we don't want to allow parallel io_sqe_files_unregister()s
happening, it's synchronised by uring_lock atm. Otherwise it's
buggy.

2. probably same with unregister and submit.

> 
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
>  fs/io_uring.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index efb6d02fea6f..b093977713ee 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>  
>  	/* wait for all refs nodes to complete */
>  	flush_delayed_work(&ctx->file_put_work);
> +	if (locked)
> +		mutex_unlock(&ctx->uring_lock);
>  	do {
>  		ret = wait_for_completion_interruptible(&data->done);
>  		if (!ret)
>  			break;
>  		ret = io_run_task_work_sig();
> -		if (ret < 0) {
> -			percpu_ref_resurrect(&data->refs);
> -			reinit_completion(&data->done);
> -			io_sqe_files_set_node(data, backup_node);
> -			return ret;
> -		}
> +		if (ret < 0)
> +			break;
>  	} while (1);
> +	if (locked)
> +		mutex_lock(&ctx->uring_lock);
> +
> +	if (ret < 0) {
> +		percpu_ref_resurrect(&data->refs);
> +		reinit_completion(&data->done);
> +		io_sqe_files_set_node(data, backup_node);
> +		return ret;
> +	}
>  
>  	__io_sqe_files_unregister(ctx);
>  	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 16:35   ` Pavel Begunkov
@ 2021-02-03 16:45     ` Pavel Begunkov
  2021-02-04  3:25       ` Hao Xu
  2021-02-05 10:03       ` Hao Xu
  0 siblings, 2 replies; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-03 16:45 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 03/02/2021 16:35, Pavel Begunkov wrote:
> On 03/02/2021 14:57, Hao Xu wrote:
>> This is caused by calling io_run_task_work_sig() to do work under
>> uring_lock while the caller io_sqe_files_unregister() already held
>> uring_lock.
>> we need to check if uring_lock is held by us when doing unlock around
>> io_run_task_work_sig() since there are code paths down to that place
>> without uring_lock held.
> 
> 1. we don't want to allow parallel io_sqe_files_unregister()s
> happening, it's synchronised by uring_lock atm. Otherwise it's
> buggy.

This one should be simple, alike to

if (percpu_refs_is_dying())
	return error; // fail *files_unregister();

> 
> 2. probably same with unregister and submit.
> 
>>
>> Reported-by: Abaci <abaci@linux.alibaba.com>
>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>> ---
>>  fs/io_uring.c | 19 +++++++++++++------
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index efb6d02fea6f..b093977713ee 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>  
>>  	/* wait for all refs nodes to complete */
>>  	flush_delayed_work(&ctx->file_put_work);
>> +	if (locked)
>> +		mutex_unlock(&ctx->uring_lock);
>>  	do {
>>  		ret = wait_for_completion_interruptible(&data->done);
>>  		if (!ret)
>>  			break;
>>  		ret = io_run_task_work_sig();
>> -		if (ret < 0) {
>> -			percpu_ref_resurrect(&data->refs);
>> -			reinit_completion(&data->done);
>> -			io_sqe_files_set_node(data, backup_node);
>> -			return ret;
>> -		}
>> +		if (ret < 0)
>> +			break;
>>  	} while (1);
>> +	if (locked)
>> +		mutex_lock(&ctx->uring_lock);
>> +
>> +	if (ret < 0) {
>> +		percpu_ref_resurrect(&data->refs);
>> +		reinit_completion(&data->done);
>> +		io_sqe_files_set_node(data, backup_node);
>> +		return ret;
>> +	}
>>  
>>  	__io_sqe_files_unregister(ctx);
>>  	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 16:45     ` Pavel Begunkov
@ 2021-02-04  3:25       ` Hao Xu
  2021-02-04 11:17         ` Pavel Begunkov
  2021-02-05 10:03       ` Hao Xu
  1 sibling, 1 reply; 20+ messages in thread
From: Hao Xu @ 2021-02-04  3:25 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/4 上午12:45, Pavel Begunkov 写道:
> On 03/02/2021 16:35, Pavel Begunkov wrote:
>> On 03/02/2021 14:57, Hao Xu wrote:
>>> This is caused by calling io_run_task_work_sig() to do work under
>>> uring_lock while the caller io_sqe_files_unregister() already held
>>> uring_lock.
>>> we need to check if uring_lock is held by us when doing unlock around
>>> io_run_task_work_sig() since there are code paths down to that place
>>> without uring_lock held.
>>
>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>> happening, it's synchronised by uring_lock atm. Otherwise it's
>> buggy.
Here "since there are code paths down to that place without uring_lock 
held" I mean code path of io_ring_ctx_free().
> 
> This one should be simple, alike to
> 
> if (percpu_refs_is_dying())
> 	return error; // fail *files_unregister();
> 
>>
>> 2. probably same with unregister and submit.
>>
>>>
>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>> ---
>>>   fs/io_uring.c | 19 +++++++++++++------
>>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index efb6d02fea6f..b093977713ee 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>   
>>>   	/* wait for all refs nodes to complete */
>>>   	flush_delayed_work(&ctx->file_put_work);
>>> +	if (locked)
>>> +		mutex_unlock(&ctx->uring_lock);
>>>   	do {
>>>   		ret = wait_for_completion_interruptible(&data->done);
>>>   		if (!ret)
>>>   			break;
>>>   		ret = io_run_task_work_sig();
>>> -		if (ret < 0) {
>>> -			percpu_ref_resurrect(&data->refs);
>>> -			reinit_completion(&data->done);
>>> -			io_sqe_files_set_node(data, backup_node);
>>> -			return ret;
>>> -		}
>>> +		if (ret < 0)
>>> +			break;
>>>   	} while (1);
>>> +	if (locked)
>>> +		mutex_lock(&ctx->uring_lock);
>>> +
>>> +	if (ret < 0) {
>>> +		percpu_ref_resurrect(&data->refs);
>>> +		reinit_completion(&data->done);
>>> +		io_sqe_files_set_node(data, backup_node);
>>> +		return ret;
>>> +	}
>>>   
>>>   	__io_sqe_files_unregister(ctx);
>>>   	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>>
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  2021-02-03 16:33   ` Pavel Begunkov
@ 2021-02-04  3:34     ` Hao Xu
  2021-02-04 11:11       ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Xu @ 2021-02-04  3:34 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/4 上午12:33, Pavel Begunkov 写道:
> On 03/02/2021 14:57, Hao Xu wrote:
>> io_sqe_files_unregister is currently called from several places:
>>      - syscall io_uring_register (with uring_lock)
>>      - io_ring_ctx_wait_and_kill() (without uring_lock)
>>
>> There is a AA type deadlock in io_sqe_files_unregister(), thus we need
>> to know if we hold uring_lock in io_sqe_files_unregister() to fix the
>> issue.
> 
> It's ugly, just take the lock and kill the patch. There can't be any
> contention during io_ring_ctx_free anyway.
Hi Pavel, I don't get it, do you mean this patch isn't needed, and we 
can just unlock(&uring_lock) before io_run_task_work_sig() and 
lock(&uring_lock) after it? I knew there won't be contention during 
io_ring_ctx_free that's why there is no uring_lock in it.
I tried to just do unlock(&uring_lock) before io_run_task_sig() without 
if(locked) check, it reports something like "there are unpaired mutex 
lock/unlock" since we cannot just unlock if it's from io_ring_ctx_free.
> 
>>
>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>> ---
>>   fs/io_uring.c | 10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 38c6cbe1ab38..efb6d02fea6f 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -7339,7 +7339,7 @@ static void io_sqe_files_set_node(struct fixed_file_data *file_data,
>>   	percpu_ref_get(&file_data->refs);
>>   }
>>   
>> -static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>   {
>>   	struct fixed_file_data *data = ctx->file_data;
>>   	struct fixed_file_ref_node *backup_node, *ref_node = NULL;
>> @@ -7872,13 +7872,13 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>>   
>>   	ret = io_sqe_files_scm(ctx);
>>   	if (ret) {
>> -		io_sqe_files_unregister(ctx);
>> +		io_sqe_files_unregister(ctx, true);
>>   		return ret;
>>   	}
>>   
>>   	ref_node = alloc_fixed_file_ref_node(ctx);
>>   	if (!ref_node) {
>> -		io_sqe_files_unregister(ctx);
>> +		io_sqe_files_unregister(ctx, true);
>>   		return -ENOMEM;
>>   	}
>>   
>> @@ -8682,7 +8682,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>>   		css_put(ctx->sqo_blkcg_css);
>>   #endif
>>   
>> -	io_sqe_files_unregister(ctx);
>> +	io_sqe_files_unregister(ctx, false);
>>   	io_eventfd_unregister(ctx);
>>   	io_destroy_buffers(ctx);
>>   	idr_destroy(&ctx->personality_idr);
>> @@ -10065,7 +10065,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>>   		ret = -EINVAL;
>>   		if (arg || nr_args)
>>   			break;
>> -		ret = io_sqe_files_unregister(ctx);
>> +		ret = io_sqe_files_unregister(ctx, true);
>>   		break;
>>   	case IORING_REGISTER_FILES_UPDATE:
>>   		ret = io_sqe_files_update(ctx, arg, nr_args);
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  2021-02-04  3:34     ` Hao Xu
@ 2021-02-04 11:11       ` Pavel Begunkov
  2021-02-04 14:49         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-04 11:11 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 04/02/2021 03:34, Hao Xu wrote:
> 在 2021/2/4 上午12:33, Pavel Begunkov 写道:
>> On 03/02/2021 14:57, Hao Xu wrote:
>>> io_sqe_files_unregister is currently called from several places:
>>>      - syscall io_uring_register (with uring_lock)
>>>      - io_ring_ctx_wait_and_kill() (without uring_lock)
>>>
>>> There is a AA type deadlock in io_sqe_files_unregister(), thus we need
>>> to know if we hold uring_lock in io_sqe_files_unregister() to fix the
>>> issue.
>>
>> It's ugly, just take the lock and kill the patch. There can't be any
>> contention during io_ring_ctx_free anyway.
> Hi Pavel, I don't get it, do you mean this patch isn't needed, and we can just unlock(&uring_lock) before io_run_task_work_sig() and lock(&uring_lock) after it? I knew there won't be contention during io_ring_ctx_free that's why there is no uring_lock in it.
> I tried to just do unlock(&uring_lock) before io_run_task_sig() without if(locked) check, it reports something like "there are unpaired mutex lock/unlock" since we cannot just unlock if it's from io_ring_ctx_free.


The ugly part is @locked. I know that there is already similar stuff
around, but I may go long why and how much I don't like it.

io_ring_ctx_free()
{
    ...
    lock(uring_lock);
    files_unregister();
    unlock(uring_lock);
    ...
}

With this you'll always have the mutex locked in unregister, so
can drop it unconditionally (if that will ever be needed). It's
also cleaner from the synchronisation perspective.

>>
>>>
>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>> ---
>>>   fs/io_uring.c | 10 +++++-----
>>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 38c6cbe1ab38..efb6d02fea6f 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -7339,7 +7339,7 @@ static void io_sqe_files_set_node(struct fixed_file_data *file_data,
>>>       percpu_ref_get(&file_data->refs);
>>>   }
>>>   -static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>   {
>>>       struct fixed_file_data *data = ctx->file_data;
>>>       struct fixed_file_ref_node *backup_node, *ref_node = NULL;
>>> @@ -7872,13 +7872,13 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>>>         ret = io_sqe_files_scm(ctx);
>>>       if (ret) {
>>> -        io_sqe_files_unregister(ctx);
>>> +        io_sqe_files_unregister(ctx, true);
>>>           return ret;
>>>       }
>>>         ref_node = alloc_fixed_file_ref_node(ctx);
>>>       if (!ref_node) {
>>> -        io_sqe_files_unregister(ctx);
>>> +        io_sqe_files_unregister(ctx, true);
>>>           return -ENOMEM;
>>>       }
>>>   @@ -8682,7 +8682,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>>>           css_put(ctx->sqo_blkcg_css);
>>>   #endif
>>>   -    io_sqe_files_unregister(ctx);
>>> +    io_sqe_files_unregister(ctx, false);
>>>       io_eventfd_unregister(ctx);
>>>       io_destroy_buffers(ctx);
>>>       idr_destroy(&ctx->personality_idr);
>>> @@ -10065,7 +10065,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>>>           ret = -EINVAL;
>>>           if (arg || nr_args)
>>>               break;
>>> -        ret = io_sqe_files_unregister(ctx);
>>> +        ret = io_sqe_files_unregister(ctx, true);
>>>           break;
>>>       case IORING_REGISTER_FILES_UPDATE:
>>>           ret = io_sqe_files_update(ctx, arg, nr_args);
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-04  3:25       ` Hao Xu
@ 2021-02-04 11:17         ` Pavel Begunkov
  2021-02-04 15:26           ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-04 11:17 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 04/02/2021 03:25, Hao Xu wrote:
> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>> uring_lock.
>>>> we need to check if uring_lock is held by us when doing unlock around
>>>> io_run_task_work_sig() since there are code paths down to that place
>>>> without uring_lock held.
>>>
>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>> buggy.
> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().

I guess it's to the 1/2, but let me outline the problem again:
if you have two tasks userspace threads sharing a ring, then they
can both and in parallel call syscall:files_unregeister. That's
a potential double percpu_ref_kill(&data->refs), or even worse.

Same for 2, but racing for the table and refs.

>>
>> This one should be simple, alike to
>>
>> if (percpu_refs_is_dying())
>>     return error; // fail *files_unregister();
>>
>>>
>>> 2. probably same with unregister and submit.
>>>
>>>>
>>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>> ---
>>>>   fs/io_uring.c | 19 +++++++++++++------
>>>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index efb6d02fea6f..b093977713ee 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>>         /* wait for all refs nodes to complete */
>>>>       flush_delayed_work(&ctx->file_put_work);
>>>> +    if (locked)
>>>> +        mutex_unlock(&ctx->uring_lock);
>>>>       do {
>>>>           ret = wait_for_completion_interruptible(&data->done);
>>>>           if (!ret)
>>>>               break;
>>>>           ret = io_run_task_work_sig();
>>>> -        if (ret < 0) {
>>>> -            percpu_ref_resurrect(&data->refs);
>>>> -            reinit_completion(&data->done);
>>>> -            io_sqe_files_set_node(data, backup_node);
>>>> -            return ret;
>>>> -        }
>>>> +        if (ret < 0)
>>>> +            break;
>>>>       } while (1);
>>>> +    if (locked)
>>>> +        mutex_lock(&ctx->uring_lock);
>>>> +
>>>> +    if (ret < 0) {
>>>> +        percpu_ref_resurrect(&data->refs);
>>>> +        reinit_completion(&data->done);
>>>> +        io_sqe_files_set_node(data, backup_node);
>>>> +        return ret;
>>>> +    }
>>>>         __io_sqe_files_unregister(ctx);
>>>>       nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>>>
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 14:57 ` [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work* Hao Xu
  2021-02-03 16:35   ` Pavel Begunkov
@ 2021-02-04 11:33   ` Pavel Begunkov
  1 sibling, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-04 11:33 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 03/02/2021 14:57, Hao Xu wrote:
> This is caused by calling io_run_task_work_sig() to do work under
> uring_lock while the caller io_sqe_files_unregister() already held
> uring_lock.
> we need to check if uring_lock is held by us when doing unlock around
> io_run_task_work_sig() since there are code paths down to that place
> without uring_lock held.

btw, better to prepare it for-5.12, seems it won't apply

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index efb6d02fea6f..b093977713ee 100644
[...]
> -		if (ret < 0) {
> -			percpu_ref_resurrect(&data->refs);
> -			reinit_completion(&data->done);
> -			io_sqe_files_set_node(data, backup_node);
> -			return ret;
> -		}
> +		if (ret < 0)
> +			break;
>  	} while (1);

while (ret >= 0) ?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister()
  2021-02-04 11:11       ` Pavel Begunkov
@ 2021-02-04 14:49         ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2021-02-04 14:49 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 2/4/21 4:11 AM, Pavel Begunkov wrote:
> On 04/02/2021 03:34, Hao Xu wrote:
>> 在 2021/2/4 上午12:33, Pavel Begunkov 写道:
>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>> io_sqe_files_unregister is currently called from several places:
>>>>      - syscall io_uring_register (with uring_lock)
>>>>      - io_ring_ctx_wait_and_kill() (without uring_lock)
>>>>
>>>> There is a AA type deadlock in io_sqe_files_unregister(), thus we need
>>>> to know if we hold uring_lock in io_sqe_files_unregister() to fix the
>>>> issue.
>>>
>>> It's ugly, just take the lock and kill the patch. There can't be any
>>> contention during io_ring_ctx_free anyway.
>> Hi Pavel, I don't get it, do you mean this patch isn't needed, and we can just unlock(&uring_lock) before io_run_task_work_sig() and lock(&uring_lock) after it? I knew there won't be contention during io_ring_ctx_free that's why there is no uring_lock in it.
>> I tried to just do unlock(&uring_lock) before io_run_task_sig() without if(locked) check, it reports something like "there are unpaired mutex lock/unlock" since we cannot just unlock if it's from io_ring_ctx_free.
> 
> 
> The ugly part is @locked. I know that there is already similar stuff
> around, but I may go long why and how much I don't like it.
> 
> io_ring_ctx_free()
> {
>     ...
>     lock(uring_lock);
>     files_unregister();
>     unlock(uring_lock);
>     ...
> }
> 
> With this you'll always have the mutex locked in unregister, so
> can drop it unconditionally (if that will ever be needed). It's
> also cleaner from the synchronisation perspective.

Yes that's a much better approach - passing around a 'locked' flag
is not very pretty, and the fewer we have of those the better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-04 11:17         ` Pavel Begunkov
@ 2021-02-04 15:26           ` Pavel Begunkov
  2021-02-05  9:57             ` Hao Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-04 15:26 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi



On 04/02/2021 11:17, Pavel Begunkov wrote:
> On 04/02/2021 03:25, Hao Xu wrote:
>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>> uring_lock.
>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>> without uring_lock held.
>>>>
>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>> buggy.
>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
> 
> I guess it's to the 1/2, but let me outline the problem again:
> if you have two tasks userspace threads sharing a ring, then they
> can both and in parallel call syscall:files_unregeister. That's
> a potential double percpu_ref_kill(&data->refs), or even worse.
> 
> Same for 2, but racing for the table and refs.

There is a couple of thoughts for this:

1. I don't like waiting without holding the lock in general, because
someone can submit more reqs in-between and so indefinitely postponing
the files_unregister.
2. I wouldn't want to add checks for that in submission path.

So, a solution I think about is to wait under the lock, If we need to
run task_works -- briefly drop the lock, run task_works and then do
all unregister all over again. Keep an eye on refs, e.g. probably
need to resurrect it.

Because we current task is busy nobody submits new requests on
its behalf, and so there can't be infinite number of in-task_work
reqs, and eventually it will just go wait/sleep forever (if not
signalled) under the mutex, so we can a kind of upper bound on
time.

> 
>>>
>>> This one should be simple, alike to
>>>
>>> if (percpu_refs_is_dying())
>>>     return error; // fail *files_unregister();
>>>
>>>>
>>>> 2. probably same with unregister and submit.
>>>>
>>>>>
>>>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>>>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>> ---
>>>>>   fs/io_uring.c | 19 +++++++++++++------
>>>>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index efb6d02fea6f..b093977713ee 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>>>         /* wait for all refs nodes to complete */
>>>>>       flush_delayed_work(&ctx->file_put_work);
>>>>> +    if (locked)
>>>>> +        mutex_unlock(&ctx->uring_lock);
>>>>>       do {
>>>>>           ret = wait_for_completion_interruptible(&data->done);
>>>>>           if (!ret)
>>>>>               break;
>>>>>           ret = io_run_task_work_sig();
>>>>> -        if (ret < 0) {
>>>>> -            percpu_ref_resurrect(&data->refs);
>>>>> -            reinit_completion(&data->done);
>>>>> -            io_sqe_files_set_node(data, backup_node);
>>>>> -            return ret;
>>>>> -        }
>>>>> +        if (ret < 0)
>>>>> +            break;
>>>>>       } while (1);
>>>>> +    if (locked)
>>>>> +        mutex_lock(&ctx->uring_lock);
>>>>> +
>>>>> +    if (ret < 0) {
>>>>> +        percpu_ref_resurrect(&data->refs);
>>>>> +        reinit_completion(&data->done);
>>>>> +        io_sqe_files_set_node(data, backup_node);
>>>>> +        return ret;
>>>>> +    }
>>>>>         __io_sqe_files_unregister(ctx);
>>>>>       nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>>>>
>>>>
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-04 15:26           ` Pavel Begunkov
@ 2021-02-05  9:57             ` Hao Xu
  2021-02-05 10:18               ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Xu @ 2021-02-05  9:57 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/4 下午11:26, Pavel Begunkov 写道:
> 
> 
> On 04/02/2021 11:17, Pavel Begunkov wrote:
>> On 04/02/2021 03:25, Hao Xu wrote:
>>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>>> uring_lock.
>>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>>> without uring_lock held.
>>>>>
>>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>>> buggy.
>>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
>>
>> I guess it's to the 1/2, but let me outline the problem again:
>> if you have two tasks userspace threads sharing a ring, then they
>> can both and in parallel call syscall:files_unregeister. That's
>> a potential double percpu_ref_kill(&data->refs), or even worse.
>>
>> Same for 2, but racing for the table and refs.
> 
> There is a couple of thoughts for this:
> 
> 1. I don't like waiting without holding the lock in general, because
> someone can submit more reqs in-between and so indefinitely postponing
> the files_unregister.
Thanks, Pavel.
I thought this issue before, until I saw this in __io_uring_register:

   if (io_register_op_must_quiesce(opcode)) {
           percpu_ref_kill(&ctx->refs);

           /*
           ¦* Drop uring mutex before waiting for references to exit. If
           ¦* another thread is currently inside io_uring_enter() it might
           ¦* need to grab the uring_lock to make progress. If we hold it
           ¦* here across the drain wait, then we can deadlock. It's safe
           ¦* to drop the mutex here, since no new references will come in
           ¦* after we've killed the percpu ref.
           ¦*/
           mutex_unlock(&ctx->uring_lock);
           do {
                   ret = wait_for_completion_interruptible(&ctx->ref_comp);
                   if (!ret)
                           break;
                   ret = io_run_task_work_sig();
                   if (ret < 0)
                           break;
           } while (1);

           mutex_lock(&ctx->uring_lock);

           if (ret) {
                   percpu_ref_resurrect(&ctx->refs);
                   goto out_quiesce;
           }
   }

So now I guess the postponement issue also exits in the above code since
there could be another thread submiting reqs to the shared ctx(or we can 
say uring fd).

> 2. I wouldn't want to add checks for that in submission path.
> 
> So, a solution I think about is to wait under the lock, If we need to
> run task_works -- briefly drop the lock, run task_works and then do
> all unregister all over again. Keep an eye on refs, e.g. probably
> need to resurrect it.
> 
> Because we current task is busy nobody submits new requests on
> its behalf, and so there can't be infinite number of in-task_work
> reqs, and eventually it will just go wait/sleep forever (if not
> signalled) under the mutex, so we can a kind of upper bound on
> time.
> 
Do you mean sleeping with timeout rather than just sleeping? I think 
this works, I'll work on this and think about the detail.
But before addressing this issue, Should I first send a patch to just 
fix the deadlock issue?

Thanks,
Hao
>>
>>>>
>>>> This one should be simple, alike to
>>>>
>>>> if (percpu_refs_is_dying())
>>>>      return error; // fail *files_unregister();
>>>>
>>>>>
>>>>> 2. probably same with unregister and submit.
>>>>>
>>>>>>
>>>>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>>>>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>> ---
>>>>>>    fs/io_uring.c | 19 +++++++++++++------
>>>>>>    1 file changed, 13 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>> index efb6d02fea6f..b093977713ee 100644
>>>>>> --- a/fs/io_uring.c
>>>>>> +++ b/fs/io_uring.c
>>>>>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>>>>          /* wait for all refs nodes to complete */
>>>>>>        flush_delayed_work(&ctx->file_put_work);
>>>>>> +    if (locked)
>>>>>> +        mutex_unlock(&ctx->uring_lock);
>>>>>>        do {
>>>>>>            ret = wait_for_completion_interruptible(&data->done);
>>>>>>            if (!ret)
>>>>>>                break;
>>>>>>            ret = io_run_task_work_sig();
>>>>>> -        if (ret < 0) {
>>>>>> -            percpu_ref_resurrect(&data->refs);
>>>>>> -            reinit_completion(&data->done);
>>>>>> -            io_sqe_files_set_node(data, backup_node);
>>>>>> -            return ret;
>>>>>> -        }
>>>>>> +        if (ret < 0)
>>>>>> +            break;
>>>>>>        } while (1);
>>>>>> +    if (locked)
>>>>>> +        mutex_lock(&ctx->uring_lock);
>>>>>> +
>>>>>> +    if (ret < 0) {
>>>>>> +        percpu_ref_resurrect(&data->refs);
>>>>>> +        reinit_completion(&data->done);
>>>>>> +        io_sqe_files_set_node(data, backup_node);
>>>>>> +        return ret;
>>>>>> +    }
>>>>>>          __io_sqe_files_unregister(ctx);
>>>>>>        nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-03 16:45     ` Pavel Begunkov
  2021-02-04  3:25       ` Hao Xu
@ 2021-02-05 10:03       ` Hao Xu
  1 sibling, 0 replies; 20+ messages in thread
From: Hao Xu @ 2021-02-05 10:03 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/4 上午12:45, Pavel Begunkov 写道:
> On 03/02/2021 16:35, Pavel Begunkov wrote:
>> On 03/02/2021 14:57, Hao Xu wrote:
>>> This is caused by calling io_run_task_work_sig() to do work under
>>> uring_lock while the caller io_sqe_files_unregister() already held
>>> uring_lock.
>>> we need to check if uring_lock is held by us when doing unlock around
>>> io_run_task_work_sig() since there are code paths down to that place
>>> without uring_lock held.
>>
>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>> happening, it's synchronised by uring_lock atm. Otherwise it's
>> buggy.
> 
> This one should be simple, alike to
> 
> if (percpu_refs_is_dying())
> 	return error; // fail *files_unregister();
> 
Agree.
>>
>> 2. probably same with unregister and submit.
>>
>>>
>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>> Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>> ---
>>>   fs/io_uring.c | 19 +++++++++++++------
>>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index efb6d02fea6f..b093977713ee 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
>>>   
>>>   	/* wait for all refs nodes to complete */
>>>   	flush_delayed_work(&ctx->file_put_work);
>>> +	if (locked)
>>> +		mutex_unlock(&ctx->uring_lock);
>>>   	do {
>>>   		ret = wait_for_completion_interruptible(&data->done);
>>>   		if (!ret)
>>>   			break;
>>>   		ret = io_run_task_work_sig();
>>> -		if (ret < 0) {
>>> -			percpu_ref_resurrect(&data->refs);
>>> -			reinit_completion(&data->done);
>>> -			io_sqe_files_set_node(data, backup_node);
>>> -			return ret;
>>> -		}
>>> +		if (ret < 0)
>>> +			break;
>>>   	} while (1);
>>> +	if (locked)
>>> +		mutex_lock(&ctx->uring_lock);
>>> +
>>> +	if (ret < 0) {
>>> +		percpu_ref_resurrect(&data->refs);
>>> +		reinit_completion(&data->done);
>>> +		io_sqe_files_set_node(data, backup_node);
>>> +		return ret;
>>> +	}
>>>   
>>>   	__io_sqe_files_unregister(ctx);
>>>   	nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);
>>>
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-05  9:57             ` Hao Xu
@ 2021-02-05 10:18               ` Pavel Begunkov
  2021-02-06 11:34                 ` Hao Xu
                                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-05 10:18 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 05/02/2021 09:57, Hao Xu wrote:
> 在 2021/2/4 下午11:26, Pavel Begunkov 写道:
>> On 04/02/2021 11:17, Pavel Begunkov wrote:
>>> On 04/02/2021 03:25, Hao Xu wrote:
>>>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>>>> uring_lock.
>>>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>>>> without uring_lock held.
>>>>>>
>>>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>>>> buggy.
>>>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
>>>
>>> I guess it's to the 1/2, but let me outline the problem again:
>>> if you have two tasks userspace threads sharing a ring, then they
>>> can both and in parallel call syscall:files_unregeister. That's
>>> a potential double percpu_ref_kill(&data->refs), or even worse.
>>>
>>> Same for 2, but racing for the table and refs.
>>
>> There is a couple of thoughts for this:
>>
>> 1. I don't like waiting without holding the lock in general, because
>> someone can submit more reqs in-between and so indefinitely postponing
>> the files_unregister.
> Thanks, Pavel.
> I thought this issue before, until I saw this in __io_uring_register:
> 
>   if (io_register_op_must_quiesce(opcode)) {
>           percpu_ref_kill(&ctx->refs);

It is different because of this kill, it will prevent submissions.

> 
>           /*
>           ¦* Drop uring mutex before waiting for references to exit. If
>           ¦* another thread is currently inside io_uring_enter() it might
>           ¦* need to grab the uring_lock to make progress. If we hold it
>           ¦* here across the drain wait, then we can deadlock. It's safe
>           ¦* to drop the mutex here, since no new references will come in
>           ¦* after we've killed the percpu ref.
>           ¦*/
>           mutex_unlock(&ctx->uring_lock);
>           do {
>                   ret = wait_for_completion_interruptible(&ctx->ref_comp);
>                   if (!ret)
>                           break;
>                   ret = io_run_task_work_sig();
>                   if (ret < 0)
>                           break;
>           } while (1);
> 
>           mutex_lock(&ctx->uring_lock);
> 
>           if (ret) {
>                   percpu_ref_resurrect(&ctx->refs);
>                   goto out_quiesce;
>           }
>   }
> 
> So now I guess the postponement issue also exits in the above code since
> there could be another thread submiting reqs to the shared ctx(or we can say uring fd).
> 
>> 2. I wouldn't want to add checks for that in submission path.
>>
>> So, a solution I think about is to wait under the lock, If we need to
>> run task_works -- briefly drop the lock, run task_works and then do
>> all unregister all over again. Keep an eye on refs, e.g. probably
>> need to resurrect it.
>>
>> Because we current task is busy nobody submits new requests on
>> its behalf, and so there can't be infinite number of in-task_work
>> reqs, and eventually it will just go wait/sleep forever (if not
>> signalled) under the mutex, so we can a kind of upper bound on
>> time.
>>
> Do you mean sleeping with timeout rather than just sleeping? I think this works, I'll work on this and think about the detail.

Without timeout -- it will be awaken when new task_works are coming in,
but Jens knows better.

> But before addressing this issue, Should I first send a patch to just fix the deadlock issue?

Do you mean the deadlock 2/2 was trying to fix? Or some else? The thread
is all about fixing it, but doing it right. Not sure there is a need for
faster but incomplete solution, if that's what you meant.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-05 10:18               ` Pavel Begunkov
@ 2021-02-06 11:34                 ` Hao Xu
  2021-02-07 17:16                   ` Pavel Begunkov
  2021-02-06 16:21                 ` Hao Xu
  2021-02-11 13:30                 ` Hao Xu
  2 siblings, 1 reply; 20+ messages in thread
From: Hao Xu @ 2021-02-06 11:34 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/5 下午6:18, Pavel Begunkov 写道:
> On 05/02/2021 09:57, Hao Xu wrote:
>> 在 2021/2/4 下午11:26, Pavel Begunkov 写道:
>>> On 04/02/2021 11:17, Pavel Begunkov wrote:
>>>> On 04/02/2021 03:25, Hao Xu wrote:
>>>>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>>>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>>>>> uring_lock.
>>>>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>>>>> without uring_lock held.
>>>>>>>
>>>>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>>>>> buggy.
>>>>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
>>>>
>>>> I guess it's to the 1/2, but let me outline the problem again:
>>>> if you have two tasks userspace threads sharing a ring, then they
>>>> can both and in parallel call syscall:files_unregeister. That's
>>>> a potential double percpu_ref_kill(&data->refs), or even worse.
>>>>
>>>> Same for 2, but racing for the table and refs.
>>>
>>> There is a couple of thoughts for this:
>>>
>>> 1. I don't like waiting without holding the lock in general, because
>>> someone can submit more reqs in-between and so indefinitely postponing
>>> the files_unregister.
>> Thanks, Pavel.
>> I thought this issue before, until I saw this in __io_uring_register:
>>
>>    if (io_register_op_must_quiesce(opcode)) {
>>            percpu_ref_kill(&ctx->refs);
> 
> It is different because of this kill, it will prevent submissions.
> 
I saw percpu_ref_is_dying(&ctx->refs) check in sq thread but not
in io_uring_enter(), so I guess there could be another thread doing
io_uring_enter() and submiting sqes.
>>
>>            /*
>>            ¦* Drop uring mutex before waiting for references to exit. If
>>            ¦* another thread is currently inside io_uring_enter() it might
>>            ¦* need to grab the uring_lock to make progress. If we hold it
>>            ¦* here across the drain wait, then we can deadlock. It's safe
>>            ¦* to drop the mutex here, since no new references will come in
>>            ¦* after we've killed the percpu ref.
>>            ¦*/
>>            mutex_unlock(&ctx->uring_lock);
>>            do {
>>                    ret = wait_for_completion_interruptible(&ctx->ref_comp);
>>                    if (!ret)
>>                            break;
>>                    ret = io_run_task_work_sig();
>>                    if (ret < 0)
>>                            break;
>>            } while (1);
>>
>>            mutex_lock(&ctx->uring_lock);
>>
>>            if (ret) {
>>                    percpu_ref_resurrect(&ctx->refs);
>>                    goto out_quiesce;
>>            }
>>    }
>>
>> So now I guess the postponement issue also exits in the above code since
>> there could be another thread submiting reqs to the shared ctx(or we can say uring fd).
>>
>>> 2. I wouldn't want to add checks for that in submission path.
>>>
>>> So, a solution I think about is to wait under the lock, If we need to
>>> run task_works -- briefly drop the lock, run task_works and then do
>>> all unregister all over again. Keep an eye on refs, e.g. probably
>>> need to resurrect it.
>>>
>>> Because we current task is busy nobody submits new requests on
>>> its behalf, and so there can't be infinite number of in-task_work
>>> reqs, and eventually it will just go wait/sleep forever (if not
>>> signalled) under the mutex, so we can a kind of upper bound on
>>> time.
Sorry Pavel, I don't quiet understand "so we can a kind of upper bound 
on time". :(
>>>
>> Do you mean sleeping with timeout rather than just sleeping? I think this works, I'll work on this and think about the detail.
> 
> Without timeout -- it will be awaken when new task_works are coming in,
> but Jens knows better.
So we can just put unlock and lock around io_run_task_work_sig()
to address the issue 2?
> 
>> But before addressing this issue, Should I first send a patch to just fix the deadlock issue?
> 
> Do you mean the deadlock 2/2 was trying to fix? Or some else? The thread
> is all about fixing it, but doing it right. Not sure there is a need for
> faster but incomplete solution, if that's what you meant.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-05 10:18               ` Pavel Begunkov
  2021-02-06 11:34                 ` Hao Xu
@ 2021-02-06 16:21                 ` Hao Xu
  2021-02-11 13:30                 ` Hao Xu
  2 siblings, 0 replies; 20+ messages in thread
From: Hao Xu @ 2021-02-06 16:21 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/5 下午6:18, Pavel Begunkov 写道:
> On 05/02/2021 09:57, Hao Xu wrote:
>> 在 2021/2/4 下午11:26, Pavel Begunkov 写道:
>>> On 04/02/2021 11:17, Pavel Begunkov wrote:
>>>> On 04/02/2021 03:25, Hao Xu wrote:
>>>>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>>>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>>>>> uring_lock.
>>>>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>>>>> without uring_lock held.
>>>>>>>
>>>>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>>>>> buggy.
>>>>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
>>>>
>>>> I guess it's to the 1/2, but let me outline the problem again:
>>>> if you have two tasks userspace threads sharing a ring, then they
>>>> can both and in parallel call syscall:files_unregeister. That's
>>>> a potential double percpu_ref_kill(&data->refs), or even worse.
>>>>
>>>> Same for 2, but racing for the table and refs.
>>>
>>> There is a couple of thoughts for this:
>>>
>>> 1. I don't like waiting without holding the lock in general, because
>>> someone can submit more reqs in-between and so indefinitely postponing
>>> the files_unregister.
>> Thanks, Pavel.
>> I thought this issue before, until I saw this in __io_uring_register:
>>
>>    if (io_register_op_must_quiesce(opcode)) {
>>            percpu_ref_kill(&ctx->refs);
> 
> It is different because of this kill, it will prevent submissions.
> 
>>
>>            /*
>>            ¦* Drop uring mutex before waiting for references to exit. If
>>            ¦* another thread is currently inside io_uring_enter() it might
>>            ¦* need to grab the uring_lock to make progress. If we hold it
>>            ¦* here across the drain wait, then we can deadlock. It's safe
>>            ¦* to drop the mutex here, since no new references will come in
>>            ¦* after we've killed the percpu ref.
>>            ¦*/
>>            mutex_unlock(&ctx->uring_lock);
>>            do {
>>                    ret = wait_for_completion_interruptible(&ctx->ref_comp);
>>                    if (!ret)
>>                            break;
>>                    ret = io_run_task_work_sig();
>>                    if (ret < 0)
>>                            break;
>>            } while (1);
>>
>>            mutex_lock(&ctx->uring_lock);
>>
>>            if (ret) {
>>                    percpu_ref_resurrect(&ctx->refs);
>>                    goto out_quiesce;
>>            }
>>    }
>>
>> So now I guess the postponement issue also exits in the above code since
>> there could be another thread submiting reqs to the shared ctx(or we can say uring fd).
>>
>>> 2. I wouldn't want to add checks for that in submission path.
>>>
>>> So, a solution I think about is to wait under the lock, If we need to
>>> run task_works -- briefly drop the lock, run task_works and then do
>>> all unregister all over again. Keep an eye on refs, e.g. probably
>>> need to resurrect it.
>>>
>>> Because we current task is busy nobody submits new requests on
>>> its behalf, and so there can't be infinite number of in-task_work
>>> reqs, and eventually it will just go wait/sleep forever (if not
>>> signalled) under the mutex, so we can a kind of upper bound on
>>> time.
>>>
>> Do you mean sleeping with timeout rather than just sleeping? I think this works, I'll work on this and think about the detail.
> 
> Without timeout -- it will be awaken when new task_works are coming in,
> but Jens knows better.
> 
Ah, gotcha, yes, I think it's because inserting a task_work raises
a signal.
>> But before addressing this issue, Should I first send a patch to just fix the deadlock issue?
> 
> Do you mean the deadlock 2/2 was trying to fix? Or some else? The thread
> is all about fixing it, but doing it right. Not sure there is a need for
> faster but incomplete solution, if that's what you meant.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-06 11:34                 ` Hao Xu
@ 2021-02-07 17:16                   ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2021-02-07 17:16 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 06/02/2021 11:34, Hao Xu wrote:
> I saw percpu_ref_is_dying(&ctx->refs) check in sq thread but not
> in io_uring_enter(), so I guess there could be another thread doing
> io_uring_enter() and submiting sqes.

Ok, looks tryget does it in a different fashion than I recalled.

> So we can just put unlock and lock around io_run_task_work_sig()
> to address the issue 2?

io_set_resource_node()
	-> percpu_ref_get()

If refs are left dying, it won't be safe for in-parallel enters.
tryget or resurrect

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*
  2021-02-05 10:18               ` Pavel Begunkov
  2021-02-06 11:34                 ` Hao Xu
  2021-02-06 16:21                 ` Hao Xu
@ 2021-02-11 13:30                 ` Hao Xu
  2 siblings, 0 replies; 20+ messages in thread
From: Hao Xu @ 2021-02-11 13:30 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/2/5 下午6:18, Pavel Begunkov 写道:
> On 05/02/2021 09:57, Hao Xu wrote:
>> 在 2021/2/4 下午11:26, Pavel Begunkov 写道:
>>> On 04/02/2021 11:17, Pavel Begunkov wrote:
>>>> On 04/02/2021 03:25, Hao Xu wrote:
>>>>> 在 2021/2/4 上午12:45, Pavel Begunkov 写道:
>>>>>> On 03/02/2021 16:35, Pavel Begunkov wrote:
>>>>>>> On 03/02/2021 14:57, Hao Xu wrote:
>>>>>>>> This is caused by calling io_run_task_work_sig() to do work under
>>>>>>>> uring_lock while the caller io_sqe_files_unregister() already held
>>>>>>>> uring_lock.
>>>>>>>> we need to check if uring_lock is held by us when doing unlock around
>>>>>>>> io_run_task_work_sig() since there are code paths down to that place
>>>>>>>> without uring_lock held.
>>>>>>>
>>>>>>> 1. we don't want to allow parallel io_sqe_files_unregister()s
>>>>>>> happening, it's synchronised by uring_lock atm. Otherwise it's
>>>>>>> buggy.
>>>>> Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().
>>>>
>>>> I guess it's to the 1/2, but let me outline the problem again:
>>>> if you have two tasks userspace threads sharing a ring, then they
>>>> can both and in parallel call syscall:files_unregeister. That's
>>>> a potential double percpu_ref_kill(&data->refs), or even worse.
>>>>
>>>> Same for 2, but racing for the table and refs.
>>>
>>> There is a couple of thoughts for this:
>>>
>>> 1. I don't like waiting without holding the lock in general, because
>>> someone can submit more reqs in-between and so indefinitely postponing
>>> the files_unregister.
>> Thanks, Pavel.
>> I thought this issue before, until I saw this in __io_uring_register:
>>
>>    if (io_register_op_must_quiesce(opcode)) {
>>            percpu_ref_kill(&ctx->refs);
> 
> It is different because of this kill, it will prevent submissions.
> 
>>
>>            /*
>>            ¦* Drop uring mutex before waiting for references to exit. If
>>            ¦* another thread is currently inside io_uring_enter() it might
>>            ¦* need to grab the uring_lock to make progress. If we hold it
>>            ¦* here across the drain wait, then we can deadlock. It's safe
>>            ¦* to drop the mutex here, since no new references will come in
>>            ¦* after we've killed the percpu ref.
>>            ¦*/
>>            mutex_unlock(&ctx->uring_lock);
>>            do {
>>                    ret = wait_for_completion_interruptible(&ctx->ref_comp);
>>                    if (!ret)
>>                            break;
>>                    ret = io_run_task_work_sig();
>>                    if (ret < 0)
>>                            break;
>>            } while (1);
>>
>>            mutex_lock(&ctx->uring_lock);
>>
>>            if (ret) {
>>                    percpu_ref_resurrect(&ctx->refs);
>>                    goto out_quiesce;
>>            }
>>    }
>>
>> So now I guess the postponement issue also exits in the above code since
>> there could be another thread submiting reqs to the shared ctx(or we can say uring fd).
>>
>>> 2. I wouldn't want to add checks for that in submission path.
>>>
>>> So, a solution I think about is to wait under the lock, If we need to
>>> run task_works -- briefly drop the lock, run task_works and then do
>>> all unregister all over again. Keep an eye on refs, e.g. probably
>>> need to resurrect it.
>>>
>>> Because we current task is busy nobody submits new requests on
>>> its behalf, and so there can't be infinite number of in-task_work
>>> reqs, and eventually it will just go wait/sleep forever (if not
>>> signalled) under the mutex, so we can a kind of upper bound on
>>> time.
>>>
>> Do you mean sleeping with timeout rather than just sleeping? I think this works, I'll work on this and think about the detail.
> 
> Without timeout -- it will be awaken when new task_works are coming in,
> but Jens knows better.
Isn't this risky, since there could possible be no more task_works
coming in, which causes it sleeps forever?
> 
>> But before addressing this issue, Should I first send a patch to just fix the deadlock issue?
> 
> Do you mean the deadlock 2/2 was trying to fix? Or some else? The thread
> is all about fixing it, but doing it right. Not sure there is a need for
> faster but incomplete solution, if that's what you meant.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-02-11 13:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03 14:57 [PATCH 0/2] fix deadlock in __io_req_task_submit() Hao Xu
2021-02-03 14:57 ` [PATCH 1/2] io_uring: add uring_lock as an argument to io_sqe_files_unregister() Hao Xu
2021-02-03 16:33   ` Pavel Begunkov
2021-02-04  3:34     ` Hao Xu
2021-02-04 11:11       ` Pavel Begunkov
2021-02-04 14:49         ` Jens Axboe
2021-02-03 14:57 ` [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work* Hao Xu
2021-02-03 16:35   ` Pavel Begunkov
2021-02-03 16:45     ` Pavel Begunkov
2021-02-04  3:25       ` Hao Xu
2021-02-04 11:17         ` Pavel Begunkov
2021-02-04 15:26           ` Pavel Begunkov
2021-02-05  9:57             ` Hao Xu
2021-02-05 10:18               ` Pavel Begunkov
2021-02-06 11:34                 ` Hao Xu
2021-02-07 17:16                   ` Pavel Begunkov
2021-02-06 16:21                 ` Hao Xu
2021-02-11 13:30                 ` Hao Xu
2021-02-05 10:03       ` Hao Xu
2021-02-04 11:33   ` Pavel Begunkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.