io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Race between io_wqe_worker() and io_wqe_wake_worker()
@ 2021-08-03  1:05 Nadav Amit
  2021-08-03 13:22 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2021-08-03  1:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hello Jens,

I encountered an issue, which appears to be a race between io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address this issue and whether I am missing something, since this seems to occur in a common scenario. Your feedback (or fix ;-)) would be appreciated.

I run on 5.13 a workload that issues multiple async read operations that should run concurrently. Some read operations can not complete for unbounded time (e.g., read from a pipe that is never written to). The problem is that occasionally another read operation that should complete gets stuck. My understanding, based on debugging and the code is that the following race (or similar) occurs:


  cpu0					cpu1
  ----					----
					io_wqe_worker()
					 schedule_timeout()
					 // timed out
  io_wqe_enqueue()
   io_wqe_wake_worker()
    // work_flags & IO_WQ_WORK_CONCURRENT
    io_wqe_activate_free_worker()
					 io_worker_exit()


Basically, io_wqe_wake_worker() can find a worker, but this worker is about to exit and is not going to process further work. Once the worker exits, the concurrency level decreases and async work might be blocked by another work. I had a look at 5.14, but did not see anything that might address this issue.

Am I missing something?

If not, all my ideas for a solution are either complicated (track required concurrency-level) or relaxed (span another worker on io_worker_exit if work_list of unbounded work is not empty).

As said, feedback would be appreciated.

Thanks,
Nadav



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03  1:05 Race between io_wqe_worker() and io_wqe_wake_worker() Nadav Amit
@ 2021-08-03 13:22 ` Jens Axboe
  2021-08-03 14:37   ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 13:22 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring

On 8/2/21 7:05 PM, Nadav Amit wrote:
> Hello Jens,
> 
> I encountered an issue, which appears to be a race between
> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
> this issue and whether I am missing something, since this seems to
> occur in a common scenario. Your feedback (or fix ;-)) would be
> appreciated.
> 
> I run on 5.13 a workload that issues multiple async read operations
> that should run concurrently. Some read operations can not complete
> for unbounded time (e.g., read from a pipe that is never written to).
> The problem is that occasionally another read operation that should
> complete gets stuck. My understanding, based on debugging and the code
> is that the following race (or similar) occurs:
> 
> 
>   cpu0					cpu1
>   ----					----
> 					io_wqe_worker()
> 					 schedule_timeout()
> 					 // timed out
>   io_wqe_enqueue()
>    io_wqe_wake_worker()
>     // work_flags & IO_WQ_WORK_CONCURRENT
>     io_wqe_activate_free_worker()
> 					 io_worker_exit()
> 
> 
> Basically, io_wqe_wake_worker() can find a worker, but this worker is
> about to exit and is not going to process further work. Once the
> worker exits, the concurrency level decreases and async work might be
> blocked by another work. I had a look at 5.14, but did not see
> anything that might address this issue.
> 
> Am I missing something?
> 
> If not, all my ideas for a solution are either complicated (track
> required concurrency-level) or relaxed (span another worker on
> io_worker_exit if work_list of unbounded work is not empty).
> 
> As said, feedback would be appreciated.

You are right that there's definitely a race here between checking the
freelist and finding a worker, but that worker is already exiting. Let
me mull over this a bit, I'll post something for you to try later today.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 13:22 ` Jens Axboe
@ 2021-08-03 14:37   ` Jens Axboe
  2021-08-03 17:25     ` Hao Xu
  2021-08-03 18:04     ` Nadav Amit
  0 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 14:37 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring

On 8/3/21 7:22 AM, Jens Axboe wrote:
> On 8/2/21 7:05 PM, Nadav Amit wrote:
>> Hello Jens,
>>
>> I encountered an issue, which appears to be a race between
>> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
>> this issue and whether I am missing something, since this seems to
>> occur in a common scenario. Your feedback (or fix ;-)) would be
>> appreciated.
>>
>> I run on 5.13 a workload that issues multiple async read operations
>> that should run concurrently. Some read operations can not complete
>> for unbounded time (e.g., read from a pipe that is never written to).
>> The problem is that occasionally another read operation that should
>> complete gets stuck. My understanding, based on debugging and the code
>> is that the following race (or similar) occurs:
>>
>>
>>   cpu0					cpu1
>>   ----					----
>> 					io_wqe_worker()
>> 					 schedule_timeout()
>> 					 // timed out
>>   io_wqe_enqueue()
>>    io_wqe_wake_worker()
>>     // work_flags & IO_WQ_WORK_CONCURRENT
>>     io_wqe_activate_free_worker()
>> 					 io_worker_exit()
>>
>>
>> Basically, io_wqe_wake_worker() can find a worker, but this worker is
>> about to exit and is not going to process further work. Once the
>> worker exits, the concurrency level decreases and async work might be
>> blocked by another work. I had a look at 5.14, but did not see
>> anything that might address this issue.
>>
>> Am I missing something?
>>
>> If not, all my ideas for a solution are either complicated (track
>> required concurrency-level) or relaxed (span another worker on
>> io_worker_exit if work_list of unbounded work is not empty).
>>
>> As said, feedback would be appreciated.
> 
> You are right that there's definitely a race here between checking the
> freelist and finding a worker, but that worker is already exiting. Let
> me mull over this a bit, I'll post something for you to try later today.

Can you try something like this? Just consider it a first tester, need
to spend a bit more time on it to ensure we fully close the gap.


diff --git a/fs/io-wq.c b/fs/io-wq.c
index cf086b01c6c6..e2da2042ee9e 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -42,6 +42,7 @@ struct io_worker {
 	refcount_t ref;
 	unsigned flags;
 	struct hlist_nulls_node nulls_node;
+	unsigned long exiting;
 	struct list_head all_list;
 	struct task_struct *task;
 	struct io_wqe *wqe;
@@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
 	struct hlist_nulls_node *n;
 	struct io_worker *worker;
 
-	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
-	if (is_a_nulls(n))
-		return false;
-
-	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
-	if (io_worker_get(worker)) {
-		wake_up_process(worker->task);
+	/*
+	 * Iterate free_list and see if we can find an idle worker to
+	 * activate. If a given worker is on the free_list but in the process
+	 * of exiting, keep trying.
+	 */
+	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
+		if (!io_worker_get(worker))
+			continue;
+		if (!test_bit(0, &worker->exiting)) {
+			wake_up_process(worker->task);
+			io_worker_release(worker);
+			return true;
+		}
 		io_worker_release(worker);
-		return true;
 	}
 
 	return false;
@@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
 		if (ret)
 			continue;
 		/* timed out, exit unless we're the fixed worker */
-		if (!(worker->flags & IO_WORKER_F_FIXED))
+		if (!(worker->flags & IO_WORKER_F_FIXED)) {
+			/*
+			 * Someone elevated our refs, which could be trying
+			 * to re-activate for work. Loop one more time for
+			 * that case.
+			 */
+			if (refcount_read(&worker->ref) != 1)
+				continue;
+			set_bit(0, &worker->exiting);
 			break;
+		}
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 14:37   ` Jens Axboe
@ 2021-08-03 17:25     ` Hao Xu
  2021-08-03 18:04     ` Nadav Amit
  1 sibling, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-03 17:25 UTC (permalink / raw)
  To: Jens Axboe, Nadav Amit; +Cc: io-uring

在 2021/8/3 下午10:37, Jens Axboe 写道:
> On 8/3/21 7:22 AM, Jens Axboe wrote:
>> On 8/2/21 7:05 PM, Nadav Amit wrote:
>>> Hello Jens,
>>>
>>> I encountered an issue, which appears to be a race between
>>> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
>>> this issue and whether I am missing something, since this seems to
>>> occur in a common scenario. Your feedback (or fix ;-)) would be
>>> appreciated.
>>>
>>> I run on 5.13 a workload that issues multiple async read operations
>>> that should run concurrently. Some read operations can not complete
>>> for unbounded time (e.g., read from a pipe that is never written to).
>>> The problem is that occasionally another read operation that should
>>> complete gets stuck. My understanding, based on debugging and the code
>>> is that the following race (or similar) occurs:
>>>
>>>
>>>    cpu0					cpu1
>>>    ----					----
>>> 					io_wqe_worker()
>>> 					 schedule_timeout()
>>> 					 // timed out
>>>    io_wqe_enqueue()
>>>     io_wqe_wake_worker()
>>>      // work_flags & IO_WQ_WORK_CONCURRENT
>>>      io_wqe_activate_free_worker()
>>> 					 io_worker_exit()
>>>
>>>
>>> Basically, io_wqe_wake_worker() can find a worker, but this worker is
>>> about to exit and is not going to process further work. Once the
>>> worker exits, the concurrency level decreases and async work might be
>>> blocked by another work. I had a look at 5.14, but did not see
>>> anything that might address this issue.
>>>
>>> Am I missing something?
>>>
>>> If not, all my ideas for a solution are either complicated (track
>>> required concurrency-level) or relaxed (span another worker on
>>> io_worker_exit if work_list of unbounded work is not empty).
>>>
>>> As said, feedback would be appreciated.
>>
>> You are right that there's definitely a race here between checking the
>> freelist and finding a worker, but that worker is already exiting. Let
>> me mull over this a bit, I'll post something for you to try later today.
> 
> Can you try something like this? Just consider it a first tester, need
> to spend a bit more time on it to ensure we fully close the gap.
> 
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index cf086b01c6c6..e2da2042ee9e 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -42,6 +42,7 @@ struct io_worker {
>   	refcount_t ref;
>   	unsigned flags;
>   	struct hlist_nulls_node nulls_node;
> +	unsigned long exiting;
>   	struct list_head all_list;
>   	struct task_struct *task;
>   	struct io_wqe *wqe;
> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
>   	struct hlist_nulls_node *n;
>   	struct io_worker *worker;
>   
> -	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
> -	if (is_a_nulls(n))
> -		return false;
> -
> -	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
> -	if (io_worker_get(worker)) {
> -		wake_up_process(worker->task);
> +	/*
> +	 * Iterate free_list and see if we can find an idle worker to
> +	 * activate. If a given worker is on the free_list but in the process
> +	 * of exiting, keep trying.
> +	 */
> +	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
> +		if (!io_worker_get(worker))
> +			continue;
> +		if (!test_bit(0, &worker->exiting)) {
> +			wake_up_process(worker->task);
> +			io_worker_release(worker);
> +			return true;
> +		}
>   		io_worker_release(worker);
> -		return true;
>   	}
>   
>   	return false;
> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
>   		if (ret)
>   			continue;
>   		/* timed out, exit unless we're the fixed worker */
> -		if (!(worker->flags & IO_WORKER_F_FIXED))
> +		if (!(worker->flags & IO_WORKER_F_FIXED)) {
> +			/*
> +			 * Someone elevated our refs, which could be trying
> +			 * to re-activate for work. Loop one more time for
> +			 * that case.
> +			 */
> +			if (refcount_read(&worker->ref) != 1)
> +				continue;
> +			set_bit(0, &worker->exiting);
>   			break;
> +		}
>   	}
>   
>   	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
> 

refcount check may not be enough, we may need another bit worker->in_use
and:
     io_wqe_activate_free_worker                io_wqe_worker

      set_bit(worker->in_use)               set_bit(worker->exiting)
      !test_bit(worker->exiting)            test_bit(worker->in_use)
      wake_up(worker)                       goto handle work

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 14:37   ` Jens Axboe
  2021-08-03 17:25     ` Hao Xu
@ 2021-08-03 18:04     ` Nadav Amit
  2021-08-03 18:14       ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2021-08-03 18:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Hao Xu



> On Aug 3, 2021, at 7:37 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 8/3/21 7:22 AM, Jens Axboe wrote:
>> On 8/2/21 7:05 PM, Nadav Amit wrote:
>>> Hello Jens,
>>> 
>>> I encountered an issue, which appears to be a race between
>>> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
>>> this issue and whether I am missing something, since this seems to
>>> occur in a common scenario. Your feedback (or fix ;-)) would be
>>> appreciated.
>>> 
>>> I run on 5.13 a workload that issues multiple async read operations
>>> that should run concurrently. Some read operations can not complete
>>> for unbounded time (e.g., read from a pipe that is never written to).
>>> The problem is that occasionally another read operation that should
>>> complete gets stuck. My understanding, based on debugging and the code
>>> is that the following race (or similar) occurs:
>>> 
>>> 
>>>  cpu0					cpu1
>>>  ----					----
>>> 					io_wqe_worker()
>>> 					 schedule_timeout()
>>> 					 // timed out
>>>  io_wqe_enqueue()
>>>   io_wqe_wake_worker()
>>>    // work_flags & IO_WQ_WORK_CONCURRENT
>>>    io_wqe_activate_free_worker()
>>> 					 io_worker_exit()
>>> 
>>> 
>>> Basically, io_wqe_wake_worker() can find a worker, but this worker is
>>> about to exit and is not going to process further work. Once the
>>> worker exits, the concurrency level decreases and async work might be
>>> blocked by another work. I had a look at 5.14, but did not see
>>> anything that might address this issue.
>>> 
>>> Am I missing something?
>>> 
>>> If not, all my ideas for a solution are either complicated (track
>>> required concurrency-level) or relaxed (span another worker on
>>> io_worker_exit if work_list of unbounded work is not empty).
>>> 
>>> As said, feedback would be appreciated.
>> 
>> You are right that there's definitely a race here between checking the
>> freelist and finding a worker, but that worker is already exiting. Let
>> me mull over this a bit, I'll post something for you to try later today.
> 
> Can you try something like this? Just consider it a first tester, need
> to spend a bit more time on it to ensure we fully close the gap.

Thanks for the quick response.

I tried you version. It works better, but my workload still gets stuck
occasionally (less frequently though). It is pretty obvious that the
version you sent still has a race, so I didn’t put the effort into
debugging it.

I should note that I have an ugly hack that does make my test pass. I
include it, although it is obviously not the right solution.



diff --git a/fs/io-wq.c b/fs/io-wq.c
index b3e8624a37d0..a8792809e416 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -165,6 +165,17 @@ static void io_worker_ref_put(struct io_wq *wq)
                complete(&wq->worker_done);
 }
 
+static void io_queue_worker_create(struct io_wqe *wqe, struct io_wqe_acct *acct);
+
+static inline bool io_wqe_run_queue(struct io_wqe *wqe)
+       __must_hold(wqe->lock)
+{
+       if (!wq_list_empty(&wqe->work_list) &&
+           !(wqe->flags & IO_WQE_FLAG_STALLED))
+               return true;
+       return false;
+}
+
 static void io_worker_exit(struct io_worker *worker)
 {
        struct io_wqe *wqe = worker->wqe;
@@ -192,17 +203,17 @@ static void io_worker_exit(struct io_worker *worker)
        raw_spin_unlock_irq(&wqe->lock);
 
        kfree_rcu(worker, rcu);
+       raw_spin_lock_irq(&wqe->lock);
+
+       if (!(flags & IO_WORKER_F_BOUND) && io_wqe_run_queue(wqe)) {
+               atomic_inc(&acct->nr_running);
+               atomic_inc(&wqe->wq->worker_refs);
+               io_queue_worker_create(wqe, acct);
+       }
        io_worker_ref_put(wqe->wq);
-       do_exit(0);
-}
 
-static inline bool io_wqe_run_queue(struct io_wqe *wqe)
-       __must_hold(wqe->lock)
-{
-       if (!wq_list_empty(&wqe->work_list) &&
-           !(wqe->flags & IO_WQE_FLAG_STALLED))
-               return true;
-       return false;
+       raw_spin_unlock_irq(&wqe->lock);
+       do_exit(0);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 18:04     ` Nadav Amit
@ 2021-08-03 18:14       ` Jens Axboe
  2021-08-03 19:20         ` Nadav Amit
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 18:14 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring, Hao Xu

On 8/3/21 12:04 PM, Nadav Amit wrote:
> 
> 
>> On Aug 3, 2021, at 7:37 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 8/3/21 7:22 AM, Jens Axboe wrote:
>>> On 8/2/21 7:05 PM, Nadav Amit wrote:
>>>> Hello Jens,
>>>>
>>>> I encountered an issue, which appears to be a race between
>>>> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
>>>> this issue and whether I am missing something, since this seems to
>>>> occur in a common scenario. Your feedback (or fix ;-)) would be
>>>> appreciated.
>>>>
>>>> I run on 5.13 a workload that issues multiple async read operations
>>>> that should run concurrently. Some read operations can not complete
>>>> for unbounded time (e.g., read from a pipe that is never written to).
>>>> The problem is that occasionally another read operation that should
>>>> complete gets stuck. My understanding, based on debugging and the code
>>>> is that the following race (or similar) occurs:
>>>>
>>>>
>>>>  cpu0					cpu1
>>>>  ----					----
>>>> 					io_wqe_worker()
>>>> 					 schedule_timeout()
>>>> 					 // timed out
>>>>  io_wqe_enqueue()
>>>>   io_wqe_wake_worker()
>>>>    // work_flags & IO_WQ_WORK_CONCURRENT
>>>>    io_wqe_activate_free_worker()
>>>> 					 io_worker_exit()
>>>>
>>>>
>>>> Basically, io_wqe_wake_worker() can find a worker, but this worker is
>>>> about to exit and is not going to process further work. Once the
>>>> worker exits, the concurrency level decreases and async work might be
>>>> blocked by another work. I had a look at 5.14, but did not see
>>>> anything that might address this issue.
>>>>
>>>> Am I missing something?
>>>>
>>>> If not, all my ideas for a solution are either complicated (track
>>>> required concurrency-level) or relaxed (span another worker on
>>>> io_worker_exit if work_list of unbounded work is not empty).
>>>>
>>>> As said, feedback would be appreciated.
>>>
>>> You are right that there's definitely a race here between checking the
>>> freelist and finding a worker, but that worker is already exiting. Let
>>> me mull over this a bit, I'll post something for you to try later today.
>>
>> Can you try something like this? Just consider it a first tester, need
>> to spend a bit more time on it to ensure we fully close the gap.
> 
> Thanks for the quick response.
> 
> I tried you version. It works better, but my workload still gets stuck
> occasionally (less frequently though). It is pretty obvious that the
> version you sent still has a race, so I didn’t put the effort into
> debugging it.

All good, thanks for testing! Is it a test case you can share? Would
help with confidence in the final solution.

> I should note that I have an ugly hack that does make my test pass. I
> include it, although it is obviously not the right solution.

Thanks, I'll take a look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 18:14       ` Jens Axboe
@ 2021-08-03 19:20         ` Nadav Amit
  2021-08-03 19:24           ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2021-08-03 19:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Hao Xu



> On Aug 3, 2021, at 11:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 8/3/21 12:04 PM, Nadav Amit wrote:
>> 
>> 
>> Thanks for the quick response.
>> 
>> I tried you version. It works better, but my workload still gets stuck
>> occasionally (less frequently though). It is pretty obvious that the
>> version you sent still has a race, so I didn’t put the effort into
>> debugging it.
> 
> All good, thanks for testing! Is it a test case you can share? Would
> help with confidence in the final solution.

Unfortunately no, since it is an entire WIP project that I am working
on (with undetermined license at this point). But I will be happy to
test any solution that you provide.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 19:20         ` Nadav Amit
@ 2021-08-03 19:24           ` Jens Axboe
  2021-08-03 19:53             ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 19:24 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring, Hao Xu

On 8/3/21 1:20 PM, Nadav Amit wrote:
> 
> 
>> On Aug 3, 2021, at 11:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 8/3/21 12:04 PM, Nadav Amit wrote:
>>>
>>>
>>> Thanks for the quick response.
>>>
>>> I tried you version. It works better, but my workload still gets stuck
>>> occasionally (less frequently though). It is pretty obvious that the
>>> version you sent still has a race, so I didn’t put the effort into
>>> debugging it.
>>
>> All good, thanks for testing! Is it a test case you can share? Would
>> help with confidence in the final solution.
> 
> Unfortunately no, since it is an entire WIP project that I am working
> on (with undetermined license at this point). But I will be happy to
> test any solution that you provide.

OK no worries, I'll see if I can tighten this up. I don't particularly
hate your solution, it would just be nice to avoid creating a new worker
if we can just keep running the current one.

I'll toss something your way in a bit...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 19:24           ` Jens Axboe
@ 2021-08-03 19:53             ` Jens Axboe
  2021-08-03 21:16               ` Nadav Amit
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 19:53 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring, Hao Xu

On 8/3/21 1:24 PM, Jens Axboe wrote:
> On 8/3/21 1:20 PM, Nadav Amit wrote:
>>
>>
>>> On Aug 3, 2021, at 11:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 8/3/21 12:04 PM, Nadav Amit wrote:
>>>>
>>>>
>>>> Thanks for the quick response.
>>>>
>>>> I tried you version. It works better, but my workload still gets stuck
>>>> occasionally (less frequently though). It is pretty obvious that the
>>>> version you sent still has a race, so I didn’t put the effort into
>>>> debugging it.
>>>
>>> All good, thanks for testing! Is it a test case you can share? Would
>>> help with confidence in the final solution.
>>
>> Unfortunately no, since it is an entire WIP project that I am working
>> on (with undetermined license at this point). But I will be happy to
>> test any solution that you provide.
> 
> OK no worries, I'll see if I can tighten this up. I don't particularly
> hate your solution, it would just be nice to avoid creating a new worker
> if we can just keep running the current one.
> 
> I'll toss something your way in a bit...

How about this? I think this largely stems from the fact that we only
do a partial running decrement on exit. Left the previous checks in
place as well, as it will reduce the amount of times that we do need
to hit that case.


diff --git a/fs/io-wq.c b/fs/io-wq.c
index cf086b01c6c6..f072995d382b 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -35,12 +35,17 @@ enum {
 	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
 };
 
+enum {
+	IO_WORKER_EXITING	= 0,	/* worker is exiting */
+};
+
 /*
  * One for each thread in a wqe pool
  */
 struct io_worker {
 	refcount_t ref;
 	unsigned flags;
+	unsigned long state;
 	struct hlist_nulls_node nulls_node;
 	struct list_head all_list;
 	struct task_struct *task;
@@ -130,6 +135,7 @@ struct io_cb_cancel_data {
 };
 
 static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
+static void io_wqe_dec_running(struct io_worker *worker);
 
 static bool io_worker_get(struct io_worker *worker)
 {
@@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker)
 {
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
-	unsigned flags;
 
 	if (refcount_dec_and_test(&worker->ref))
 		complete(&worker->ref_done);
 	wait_for_completion(&worker->ref_done);
 
-	preempt_disable();
-	current->flags &= ~PF_IO_WORKER;
-	flags = worker->flags;
-	worker->flags = 0;
-	if (flags & IO_WORKER_F_RUNNING)
-		atomic_dec(&acct->nr_running);
-	worker->flags = 0;
-	preempt_enable();
-
 	raw_spin_lock_irq(&wqe->lock);
-	if (flags & IO_WORKER_F_FREE)
+	if (worker->flags & IO_WORKER_F_FREE)
 		hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
 	acct->nr_workers--;
+	preempt_disable();
+	io_wqe_dec_running(worker);
+	worker->flags = 0;
+	current->flags &= ~PF_IO_WORKER;
+	preempt_enable();
 	raw_spin_unlock_irq(&wqe->lock);
 
 	kfree_rcu(worker, rcu);
@@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
 	struct hlist_nulls_node *n;
 	struct io_worker *worker;
 
-	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
-	if (is_a_nulls(n))
-		return false;
-
-	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
-	if (io_worker_get(worker)) {
-		wake_up_process(worker->task);
+	/*
+	 * Iterate free_list and see if we can find an idle worker to
+	 * activate. If a given worker is on the free_list but in the process
+	 * of exiting, keep trying.
+	 */
+	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
+		if (!io_worker_get(worker))
+			continue;
+		if (!test_bit(IO_WORKER_EXITING, &worker->state)) {
+			wake_up_process(worker->task);
+			io_worker_release(worker);
+			return true;
+		}
 		io_worker_release(worker);
-		return true;
 	}
 
 	return false;
@@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
 		if (ret)
 			continue;
 		/* timed out, exit unless we're the fixed worker */
-		if (!(worker->flags & IO_WORKER_F_FIXED))
+		if (!(worker->flags & IO_WORKER_F_FIXED)) {
+			/*
+			 * Someone elevated our refs, which could be trying
+			 * to re-activate for work. Loop one more time for
+			 * that case.
+			 */
+			if (refcount_read(&worker->ref) != 1)
+				continue;
+			set_bit(IO_WORKER_EXITING, &worker->state);
 			break;
+		}
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 19:53             ` Jens Axboe
@ 2021-08-03 21:16               ` Nadav Amit
  2021-08-03 21:25                 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2021-08-03 21:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Hao Xu



> On Aug 3, 2021, at 12:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> How about this? I think this largely stems from the fact that we only
> do a partial running decrement on exit. Left the previous checks in
> place as well, as it will reduce the amount of times that we do need
> to hit that case.

It did not apply cleanly on my 5.13, but after I cleaned it, it still
got stuck (more frequently than when I used your previous solution).

I do not see the problem related to the partial running decrement.
Thinking of it, I think that the problem might even happen if
multiple calls to io_wqe_activate_free_worker() wake up the same worker,
not realizing that they race (since __io_worker_busy() was still not
called by io_worker_handle_work()).

Anyhow, I think there are a few problems in the patch you sent. Once I
addressed a couple of problems, my test passes, but I am not sure you
actually want to final result, and I am not sure it is robust/correct.

See my comments below for the changes I added and other questions I
have (you can answer only if you have time).

> 
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index cf086b01c6c6..f072995d382b 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -35,12 +35,17 @@ enum {
> 	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
> };
> 
> +enum {
> +	IO_WORKER_EXITING	= 0,	/* worker is exiting */
> +};
> +
> /*
>  * One for each thread in a wqe pool
>  */
> struct io_worker {
> 	refcount_t ref;
> 	unsigned flags;
> +	unsigned long state;
> 	struct hlist_nulls_node nulls_node;
> 	struct list_head all_list;
> 	struct task_struct *task;
> @@ -130,6 +135,7 @@ struct io_cb_cancel_data {
> };
> 
> static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
> +static void io_wqe_dec_running(struct io_worker *worker);
> 
> static bool io_worker_get(struct io_worker *worker)
> {
> @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker)
> {
> 	struct io_wqe *wqe = worker->wqe;
> 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
> -	unsigned flags;
> 
> 	if (refcount_dec_and_test(&worker->ref))
> 		complete(&worker->ref_done);
> 	wait_for_completion(&worker->ref_done);
> 
> -	preempt_disable();
> -	current->flags &= ~PF_IO_WORKER;
> -	flags = worker->flags;
> -	worker->flags = 0;
> -	if (flags & IO_WORKER_F_RUNNING)
> -		atomic_dec(&acct->nr_running);
> -	worker->flags = 0;
> -	preempt_enable();
> -
> 	raw_spin_lock_irq(&wqe->lock);
> -	if (flags & IO_WORKER_F_FREE)
> +	if (worker->flags & IO_WORKER_F_FREE)
> 		hlist_nulls_del_rcu(&worker->nulls_node);
> 	list_del_rcu(&worker->all_list);
> 	acct->nr_workers--;
> +	preempt_disable();
> +	io_wqe_dec_running(worker);

IIUC, in the scenario I encountered, acct->nr_running might be non-zero,
but still a new worker would be needed. So the check in io_wqe_dec_running()
is insufficient to spawn a new worker at this point, no?

> +	worker->flags = 0;
> +	current->flags &= ~PF_IO_WORKER;
> +	preempt_enable();
> 	raw_spin_unlock_irq(&wqe->lock);
> 
> 	kfree_rcu(worker, rcu);
> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
> 	struct hlist_nulls_node *n;
> 	struct io_worker *worker;
> 
> -	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
> -	if (is_a_nulls(n))
> -		return false;
> -
> -	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
> -	if (io_worker_get(worker)) {
> -		wake_up_process(worker->task);
> +	/*
> +	 * Iterate free_list and see if we can find an idle worker to
> +	 * activate. If a given worker is on the free_list but in the process
> +	 * of exiting, keep trying.
> +	 */
> +	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
> +		if (!io_worker_get(worker))
> +			continue;

Presumably you want to rely on the order between io_worker_get(), i.e.
the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier
is needed here (since refcount_inc_not_zero() returns a value) but
documentation would help. Anyhow, I do not see how it helps.

> +		if (!test_bit(IO_WORKER_EXITING, &worker->state)) {
> +			wake_up_process(worker->task);

So this might be the main problem. The worker might be in between waking
and setting IO_WORKER_EXITING. One option (that I tried and works, at
least in limited testing), is to look whether the process was actually
woken according to the return value of wake_up_process() and not to
use workers that were not actually woken.

So I changed it to:
                        if (wake_up_process(worker->task)) {
                                io_worker_release(worker);
                                return true;
                        }


> +			io_worker_release(worker);

The refcount is decreased, so the refcount_read in io_wqe_worker()
would not see the elevated refcount. No?

> +			return true;
> +		}
> 		io_worker_release(worker);
> -		return true;
> 	}
> 
> 	return false;
> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
> 		if (ret)
> 			continue;
> 		/* timed out, exit unless we're the fixed worker */
> -		if (!(worker->flags & IO_WORKER_F_FIXED))
> +		if (!(worker->flags & IO_WORKER_F_FIXED)) {
> +			/*
> +			 * Someone elevated our refs, which could be trying
> +			 * to re-activate for work. Loop one more time for
> +			 * that case.
> +			 */
> +			if (refcount_read(&worker->ref) != 1)
> +				continue;

I am not sure what it serves, as the refcount is decreased in
io_wqe_activate_free_worker() right after wake_up_process().

Anyhow, presumably you need smp_mb__before_atomic() here, no? I added
one. Yet, without the check in the wake_up_process() this still seems
borken.

> +			set_bit(IO_WORKER_EXITING, &worker->state);
> 			break;
> +		}
> 	}



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Race between io_wqe_worker() and io_wqe_wake_worker()
  2021-08-03 21:16               ` Nadav Amit
@ 2021-08-03 21:25                 ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2021-08-03 21:25 UTC (permalink / raw)
  To: Nadav Amit; +Cc: io-uring, Hao Xu

On 8/3/21 3:16 PM, Nadav Amit wrote:
> 
> 
>> On Aug 3, 2021, at 12:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> How about this? I think this largely stems from the fact that we only
>> do a partial running decrement on exit. Left the previous checks in
>> place as well, as it will reduce the amount of times that we do need
>> to hit that case.
> 
> It did not apply cleanly on my 5.13, but after I cleaned it, it still
> got stuck (more frequently than when I used your previous solution).
> 
> I do not see the problem related to the partial running decrement.
> Thinking of it, I think that the problem might even happen if
> multiple calls to io_wqe_activate_free_worker() wake up the same worker,
> not realizing that they race (since __io_worker_busy() was still not
> called by io_worker_handle_work()).

That's actually by design for io-wq in general, we assume that the work
won't block, and in that case we only want to activate the one worker.

> Anyhow, I think there are a few problems in the patch you sent. Once I
> addressed a couple of problems, my test passes, but I am not sure you
> actually want to final result, and I am not sure it is robust/correct.
> 
> See my comments below for the changes I added and other questions I
> have (you can answer only if you have time).
> 
>>
>>
>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>> index cf086b01c6c6..f072995d382b 100644
>> --- a/fs/io-wq.c
>> +++ b/fs/io-wq.c
>> @@ -35,12 +35,17 @@ enum {
>> 	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
>> };
>>
>> +enum {
>> +	IO_WORKER_EXITING	= 0,	/* worker is exiting */
>> +};
>> +
>> /*
>>  * One for each thread in a wqe pool
>>  */
>> struct io_worker {
>> 	refcount_t ref;
>> 	unsigned flags;
>> +	unsigned long state;
>> 	struct hlist_nulls_node nulls_node;
>> 	struct list_head all_list;
>> 	struct task_struct *task;
>> @@ -130,6 +135,7 @@ struct io_cb_cancel_data {
>> };
>>
>> static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
>> +static void io_wqe_dec_running(struct io_worker *worker);
>>
>> static bool io_worker_get(struct io_worker *worker)
>> {
>> @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker)
>> {
>> 	struct io_wqe *wqe = worker->wqe;
>> 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
>> -	unsigned flags;
>>
>> 	if (refcount_dec_and_test(&worker->ref))
>> 		complete(&worker->ref_done);
>> 	wait_for_completion(&worker->ref_done);
>>
>> -	preempt_disable();
>> -	current->flags &= ~PF_IO_WORKER;
>> -	flags = worker->flags;
>> -	worker->flags = 0;
>> -	if (flags & IO_WORKER_F_RUNNING)
>> -		atomic_dec(&acct->nr_running);
>> -	worker->flags = 0;
>> -	preempt_enable();
>> -
>> 	raw_spin_lock_irq(&wqe->lock);
>> -	if (flags & IO_WORKER_F_FREE)
>> +	if (worker->flags & IO_WORKER_F_FREE)
>> 		hlist_nulls_del_rcu(&worker->nulls_node);
>> 	list_del_rcu(&worker->all_list);
>> 	acct->nr_workers--;
>> +	preempt_disable();
>> +	io_wqe_dec_running(worker);
> 
> IIUC, in the scenario I encountered, acct->nr_running might be non-zero,
> but still a new worker would be needed. So the check in io_wqe_dec_running()
> is insufficient to spawn a new worker at this point, no?

If nr_running != 0, then we have active workers. They will either
complete the work they have without blocking, or if they block, then
we'll create a new one. So it really should be enough, I'm a bit
puzzled...

>> +	worker->flags = 0;
>> +	current->flags &= ~PF_IO_WORKER;
>> +	preempt_enable();
>> 	raw_spin_unlock_irq(&wqe->lock);
>>
>> 	kfree_rcu(worker, rcu);
>> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
>> 	struct hlist_nulls_node *n;
>> 	struct io_worker *worker;
>>
>> -	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
>> -	if (is_a_nulls(n))
>> -		return false;
>> -
>> -	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
>> -	if (io_worker_get(worker)) {
>> -		wake_up_process(worker->task);
>> +	/*
>> +	 * Iterate free_list and see if we can find an idle worker to
>> +	 * activate. If a given worker is on the free_list but in the process
>> +	 * of exiting, keep trying.
>> +	 */
>> +	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
>> +		if (!io_worker_get(worker))
>> +			continue;
> 
> Presumably you want to rely on the order between io_worker_get(), i.e.
> the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier
> is needed here (since refcount_inc_not_zero() returns a value) but
> documentation would help. Anyhow, I do not see how it helps.

Right, no extra barriers needed.

>> +		if (!test_bit(IO_WORKER_EXITING, &worker->state)) {
>> +			wake_up_process(worker->task);
> 
> So this might be the main problem. The worker might be in between waking
> and setting IO_WORKER_EXITING. One option (that I tried and works, at
> least in limited testing), is to look whether the process was actually
> woken according to the return value of wake_up_process() and not to
> use workers that were not actually woken.
> 
> So I changed it to:
>                         if (wake_up_process(worker->task)) {
>                                 io_worker_release(worker);
>                                 return true;
>                         }
> 
> 
>> +			io_worker_release(worker);
> 
> The refcount is decreased, so the refcount_read in io_wqe_worker()
> would not see the elevated refcount. No?

That's probably not a bad idea, though not quite sure that'd always be
safe. I'm going to need to look deeper, because we really should not
have a lot of concurrent activity here in terms of multiple issuers
looking up free workers and activating them.

Can you share a bit about what the workload looks like? That might help
create a reproducer, which would be handy going forward as well.

>> +			return true;
>> +		}
>> 		io_worker_release(worker);
>> -		return true;
>> 	}
>>
>> 	return false;
>> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
>> 		if (ret)
>> 			continue;
>> 		/* timed out, exit unless we're the fixed worker */
>> -		if (!(worker->flags & IO_WORKER_F_FIXED))
>> +		if (!(worker->flags & IO_WORKER_F_FIXED)) {
>> +			/*
>> +			 * Someone elevated our refs, which could be trying
>> +			 * to re-activate for work. Loop one more time for
>> +			 * that case.
>> +			 */
>> +			if (refcount_read(&worker->ref) != 1)
>> +				continue;
> 
> I am not sure what it serves, as the refcount is decreased in
> io_wqe_activate_free_worker() right after wake_up_process().

It should just go away I think, it'd be better to cut the patch down to
the functional part.

> Anyhow, presumably you need smp_mb__before_atomic() here, no? I added
> one. Yet, without the check in the wake_up_process() this still seems
> borken.

Yes, it would need that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-03 21:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03  1:05 Race between io_wqe_worker() and io_wqe_wake_worker() Nadav Amit
2021-08-03 13:22 ` Jens Axboe
2021-08-03 14:37   ` Jens Axboe
2021-08-03 17:25     ` Hao Xu
2021-08-03 18:04     ` Nadav Amit
2021-08-03 18:14       ` Jens Axboe
2021-08-03 19:20         ` Nadav Amit
2021-08-03 19:24           ` Jens Axboe
2021-08-03 19:53             ` Jens Axboe
2021-08-03 21:16               ` Nadav Amit
2021-08-03 21:25                 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).