linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
@ 2018-08-23 13:08 Jianchao Wang
  2018-08-23 15:37 ` Jens Axboe
  2018-08-23 15:42 ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Jianchao Wang @ 2018-08-23 13:08 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel

2887e41 (blk-wbt: Avoid lock contention and thundering herd
issue in wbt_wait) introduces two cases that could miss wakeup:
 - __wbt_done only wakes up one waiter one time. There could be
   multiple waiters and (limit - inflight) > 1 at the moment.

 - When the waiter is waked up, it is still on wait queue and set
   to TASK_UNINTERRUPTIBLE immediately, so this waiter could be
   waked up one more time. If a __wbt_done comes and wakes up
   again, the prevous waiter may waste a wakeup.

To fix them and avoid to introduce too much lock contention, we
introduce our own wake up func wbt_wake_function in __wbt_wait and
use wake_up_all in __wbt_done. wbt_wake_function will try to get
wbt budget firstly, if sucesses, wake up the process, otherwise,
return -1 to interrupt the wake up loop.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait)
Cc: Anchal Agarwal <anchalag@amazon.com>
Cc: Frank van der Linden <fllinden@amazon.com>
---
 block/blk-wbt.c | 78 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 24 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c9358f1..2667590 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+			wake_up_all(&rqw->wait);
 	}
 }
 
@@ -481,6 +481,40 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+};
+
+static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
+		int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = curr->private;
+
+	/*
+	 * If fail to get budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	wake_up_process(data->curr);
+
+	list_del_init(&curr->entry);
+	return 1;
+}
+
+static inline void wbt_init_wait(struct wait_queue_entry *wait,
+		struct wbt_wait_data *data)
+{
+	INIT_LIST_HEAD(&wait->entry);
+	wait->flags = 0;
+	wait->func = wbt_wake_function;
+	wait->private = data;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,31 +525,27 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
-	bool has_sleeper;
-
-	has_sleeper = wq_has_sleeper(&rqw->wait);
-	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+	struct wait_queue_entry wait;
+	struct wbt_wait_data data = {
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
+
+	if (!wq_has_sleeper(&rqw->wait) &&
+			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-			break;
-
-		if (lock) {
-			spin_unlock_irq(lock);
-			io_schedule();
-			spin_lock_irq(lock);
-		} else
-			io_schedule();
-		has_sleeper = false;
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+	wbt_init_wait(&wait, &data);
+	prepare_to_wait_exclusive(&rqw->wait, &wait,
+			TASK_UNINTERRUPTIBLE);
+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else
+		io_schedule();
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 13:08 [PATCH] blk-wbt: get back the missed wakeup from __wbt_done Jianchao Wang
@ 2018-08-23 15:37 ` Jens Axboe
  2018-08-23 16:24   ` van der Linden, Frank
  2018-08-23 15:42 ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-23 15:37 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: linux-block, linux-kernel, anchalag, van der Linden, Frank

On 8/23/18 7:08 AM, Jianchao Wang wrote:
> 2887e41 (blk-wbt: Avoid lock contention and thundering herd
> issue in wbt_wait) introduces two cases that could miss wakeup:
>  - __wbt_done only wakes up one waiter one time. There could be
>    multiple waiters and (limit - inflight) > 1 at the moment.
> 
>  - When the waiter is waked up, it is still on wait queue and set
>    to TASK_UNINTERRUPTIBLE immediately, so this waiter could be
>    waked up one more time. If a __wbt_done comes and wakes up
>    again, the prevous waiter may waste a wakeup.
> 
> To fix them and avoid to introduce too much lock contention, we
> introduce our own wake up func wbt_wake_function in __wbt_wait and
> use wake_up_all in __wbt_done. wbt_wake_function will try to get
> wbt budget firstly, if sucesses, wake up the process, otherwise,
> return -1 to interrupt the wake up loop.

I really like this approach, since it'll naturally wake up as many
as we can instead of either being single wakeups, or wake all.

BTW, you added Anchal and Frank to the CC in the patch, but they 
are not actually CC'ed. Doing that now - can you guys give this
a whirl? Should still solve the thundering herd issue, but be
closer to the original logic in terms of wakeup. Actually better,
since we remain on list and remain ordered.

Leaving the patch below for you guys.

> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait)
> Cc: Anchal Agarwal <anchalag@amazon.com>
> Cc: Frank van der Linden <fllinden@amazon.com>
> ---
>  block/blk-wbt.c | 78 +++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 54 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index c9358f1..2667590 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
>  		int diff = limit - inflight;
>  
>  		if (!inflight || diff >= rwb->wb_background / 2)
> -			wake_up(&rqw->wait);
> +			wake_up_all(&rqw->wait);
>  	}
>  }
>  
> @@ -481,6 +481,40 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
>  	return limit;
>  }
>  
> +struct wbt_wait_data {
> +	struct task_struct *curr;
> +	struct rq_wb *rwb;
> +	struct rq_wait *rqw;
> +	unsigned long rw;
> +};
> +
> +static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
> +		int wake_flags, void *key)
> +{
> +	struct wbt_wait_data *data = curr->private;
> +
> +	/*
> +	 * If fail to get budget, return -1 to interrupt the wake up
> +	 * loop in __wake_up_common.
> +	 */
> +	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
> +		return -1;
> +
> +	wake_up_process(data->curr);
> +
> +	list_del_init(&curr->entry);
> +	return 1;
> +}
> +
> +static inline void wbt_init_wait(struct wait_queue_entry *wait,
> +		struct wbt_wait_data *data)
> +{
> +	INIT_LIST_HEAD(&wait->entry);
> +	wait->flags = 0;
> +	wait->func = wbt_wake_function;
> +	wait->private = data;
> +}
> +
>  /*
>   * Block if we will exceed our limit, or if we are currently waiting for
>   * the timer to kick off queuing again.
> @@ -491,31 +525,27 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  	__acquires(lock)
>  {
>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
> -	DECLARE_WAITQUEUE(wait, current);
> -	bool has_sleeper;
> -
> -	has_sleeper = wq_has_sleeper(&rqw->wait);
> -	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> +	struct wait_queue_entry wait;
> +	struct wbt_wait_data data = {
> +		.curr = current,
> +		.rwb = rwb,
> +		.rqw = rqw,
> +		.rw = rw,
> +	};
> +
> +	if (!wq_has_sleeper(&rqw->wait) &&
> +			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>  		return;
>  
> -	add_wait_queue_exclusive(&rqw->wait, &wait);
> -	do {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> -
> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> -			break;
> -
> -		if (lock) {
> -			spin_unlock_irq(lock);
> -			io_schedule();
> -			spin_lock_irq(lock);
> -		} else
> -			io_schedule();
> -		has_sleeper = false;
> -	} while (1);
> -
> -	__set_current_state(TASK_RUNNING);
> -	remove_wait_queue(&rqw->wait, &wait);
> +	wbt_init_wait(&wait, &data);
> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
> +			TASK_UNINTERRUPTIBLE);
> +	if (lock) {
> +		spin_unlock_irq(lock);
> +		io_schedule();
> +		spin_lock_irq(lock);
> +	} else
> +		io_schedule();
>  }
>  
>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
> 


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 13:08 [PATCH] blk-wbt: get back the missed wakeup from __wbt_done Jianchao Wang
  2018-08-23 15:37 ` Jens Axboe
@ 2018-08-23 15:42 ` Jens Axboe
  2018-08-24  2:06   ` jianchao.wang
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-23 15:42 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: linux-block, linux-kernel

On 8/23/18 7:08 AM, Jianchao Wang wrote:
> @@ -491,31 +525,27 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  	__acquires(lock)
>  {
>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
> -	DECLARE_WAITQUEUE(wait, current);
> -	bool has_sleeper;
> -
> -	has_sleeper = wq_has_sleeper(&rqw->wait);
> -	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> +	struct wait_queue_entry wait;
> +	struct wbt_wait_data data = {
> +		.curr = current,
> +		.rwb = rwb,
> +		.rqw = rqw,
> +		.rw = rw,
> +	};
> +
> +	if (!wq_has_sleeper(&rqw->wait) &&
> +			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>  		return;
>  
> -	add_wait_queue_exclusive(&rqw->wait, &wait);
> -	do {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> -
> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> -			break;
> -
> -		if (lock) {
> -			spin_unlock_irq(lock);
> -			io_schedule();
> -			spin_lock_irq(lock);
> -		} else
> -			io_schedule();
> -		has_sleeper = false;
> -	} while (1);
> -
> -	__set_current_state(TASK_RUNNING);
> -	remove_wait_queue(&rqw->wait, &wait);
> +	wbt_init_wait(&wait, &data);
> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
> +			TASK_UNINTERRUPTIBLE);
> +	if (lock) {
> +		spin_unlock_irq(lock);
> +		io_schedule();
> +		spin_lock_irq(lock);
> +	} else
> +		io_schedule();

Aren't we still missing a get-token attempt after adding to the
waitqueue? For the case where someone frees the token after your initial
check, but before you add yourself to the waitqueue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 15:37 ` Jens Axboe
@ 2018-08-23 16:24   ` van der Linden, Frank
       [not found]     ` <20180823210144.GB5624@kaos-source-ops-60001.pdx1.amazon.com>
  0 siblings, 1 reply; 22+ messages in thread
From: van der Linden, Frank @ 2018-08-23 16:24 UTC (permalink / raw)
  To: Jens Axboe, Jianchao Wang; +Cc: linux-block, linux-kernel, Agarwal, Anchal

On 8/23/18 8:37 AM, Jens Axboe wrote:
> On 8/23/18 7:08 AM, Jianchao Wang wrote:
>> 2887e41 (blk-wbt: Avoid lock contention and thundering herd
>> issue in wbt_wait) introduces two cases that could miss wakeup:
>>  - __wbt_done only wakes up one waiter one time. There could be
>>    multiple waiters and (limit - inflight) > 1 at the moment.
>>
>>  - When the waiter is waked up, it is still on wait queue and set
>>    to TASK_UNINTERRUPTIBLE immediately, so this waiter could be
>>    waked up one more time. If a __wbt_done comes and wakes up
>>    again, the prevous waiter may waste a wakeup.
>>
>> To fix them and avoid to introduce too much lock contention, we
>> introduce our own wake up func wbt_wake_function in __wbt_wait and
>> use wake_up_all in __wbt_done. wbt_wake_function will try to get
>> wbt budget firstly, if sucesses, wake up the process, otherwise,
>> return -1 to interrupt the wake up loop.
> I really like this approach, since it'll naturally wake up as many
> as we can instead of either being single wakeups, or wake all.
>
> BTW, you added Anchal and Frank to the CC in the patch, but they 
> are not actually CC'ed. Doing that now - can you guys give this
> a whirl? Should still solve the thundering herd issue, but be
> closer to the original logic in terms of wakeup. Actually better,
> since we remain on list and remain ordered.
>
> Leaving the patch below for you guys.
>
>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>> Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait)
>> Cc: Anchal Agarwal <anchalag@amazon.com>
>> Cc: Frank van der Linden <fllinden@amazon.com>
>> ---
>>  block/blk-wbt.c | 78 +++++++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 54 insertions(+), 24 deletions(-)
>>
>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>> index c9358f1..2667590 100644
>> --- a/block/blk-wbt.c
>> +++ b/block/blk-wbt.c
>> @@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
>>  		int diff = limit - inflight;
>>  
>>  		if (!inflight || diff >= rwb->wb_background / 2)
>> -			wake_up(&rqw->wait);
>> +			wake_up_all(&rqw->wait);
>>  	}
>>  }
>>  
>> @@ -481,6 +481,40 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
>>  	return limit;
>>  }
>>  
>> +struct wbt_wait_data {
>> +	struct task_struct *curr;
>> +	struct rq_wb *rwb;
>> +	struct rq_wait *rqw;
>> +	unsigned long rw;
>> +};
>> +
>> +static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
>> +		int wake_flags, void *key)
>> +{
>> +	struct wbt_wait_data *data = curr->private;
>> +
>> +	/*
>> +	 * If fail to get budget, return -1 to interrupt the wake up
>> +	 * loop in __wake_up_common.
>> +	 */
>> +	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
>> +		return -1;
>> +
>> +	wake_up_process(data->curr);
>> +
>> +	list_del_init(&curr->entry);
>> +	return 1;
>> +}
>> +
>> +static inline void wbt_init_wait(struct wait_queue_entry *wait,
>> +		struct wbt_wait_data *data)
>> +{
>> +	INIT_LIST_HEAD(&wait->entry);
>> +	wait->flags = 0;
>> +	wait->func = wbt_wake_function;
>> +	wait->private = data;
>> +}
>> +
>>  /*
>>   * Block if we will exceed our limit, or if we are currently waiting for
>>   * the timer to kick off queuing again.
>> @@ -491,31 +525,27 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>>  	__acquires(lock)
>>  {
>>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
>> -	DECLARE_WAITQUEUE(wait, current);
>> -	bool has_sleeper;
>> -
>> -	has_sleeper = wq_has_sleeper(&rqw->wait);
>> -	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>> +	struct wait_queue_entry wait;
>> +	struct wbt_wait_data data = {
>> +		.curr = current,
>> +		.rwb = rwb,
>> +		.rqw = rqw,
>> +		.rw = rw,
>> +	};
>> +
>> +	if (!wq_has_sleeper(&rqw->wait) &&
>> +			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>>  		return;
>>  
>> -	add_wait_queue_exclusive(&rqw->wait, &wait);
>> -	do {
>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>> -
>> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>> -			break;
>> -
>> -		if (lock) {
>> -			spin_unlock_irq(lock);
>> -			io_schedule();
>> -			spin_lock_irq(lock);
>> -		} else
>> -			io_schedule();
>> -		has_sleeper = false;
>> -	} while (1);
>> -
>> -	__set_current_state(TASK_RUNNING);
>> -	remove_wait_queue(&rqw->wait, &wait);
>> +	wbt_init_wait(&wait, &data);
>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
>> +			TASK_UNINTERRUPTIBLE);
>> +	if (lock) {
>> +		spin_unlock_irq(lock);
>> +		io_schedule();
>> +		spin_lock_irq(lock);
>> +	} else
>> +		io_schedule();
>>  }
>>  
>>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
>>
>
I like it, this combines some of the ideas outlined in our previous
thread in to a good solution.

One comment, though: I think we need to set the task state back to
TASK_RUNNING after being woken up, right?

Frank


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
       [not found]     ` <20180823210144.GB5624@kaos-source-ops-60001.pdx1.amazon.com>
@ 2018-08-23 23:03       ` Jens Axboe
  2018-08-23 23:14         ` Jens Axboe
       [not found]         ` <20180824181223.GA9049@kaos-source-ops-60001.pdx1.amazon.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2018-08-23 23:03 UTC (permalink / raw)
  To: Anchal Agarwal, van der Linden, Frank, jianchao.w.wang
  Cc: mlinux-block@vger.kernel.org, linux-kernel

On 8/23/18 3:01 PM, Anchal Agarwal wrote:
> On Thu, Aug 23, 2018 at 10:24:00AM -0600, van der Linden, Frank wrote:
>> On 8/23/18 8:37 AM, Jens Axboe wrote:
>>> On 8/23/18 7:08 AM, Jianchao Wang wrote:
>>>> 2887e41 (blk-wbt: Avoid lock contention and thundering herd
>>>> issue in wbt_wait) introduces two cases that could miss wakeup:
>>>>  - __wbt_done only wakes up one waiter one time. There could be
>>>>    multiple waiters and (limit - inflight) > 1 at the moment.
>>>>
>>>>  - When the waiter is waked up, it is still on wait queue and set
>>>>    to TASK_UNINTERRUPTIBLE immediately, so this waiter could be
>>>>    waked up one more time. If a __wbt_done comes and wakes up
>>>>    again, the prevous waiter may waste a wakeup.
>>>>
>>>> To fix them and avoid to introduce too much lock contention, we
>>>> introduce our own wake up func wbt_wake_function in __wbt_wait and
>>>> use wake_up_all in __wbt_done. wbt_wake_function will try to get
>>>> wbt budget firstly, if sucesses, wake up the process, otherwise,
>>>> return -1 to interrupt the wake up loop.
>>> I really like this approach, since it'll naturally wake up as many
>>> as we can instead of either being single wakeups, or wake all.
>>>
>>> BTW, you added Anchal and Frank to the CC in the patch, but they 
>>> are not actually CC'ed. Doing that now - can you guys give this
>>> a whirl? Should still solve the thundering herd issue, but be
>>> closer to the original logic in terms of wakeup. Actually better,
>>> since we remain on list and remain ordered.
>>>
>>> Leaving the patch below for you guys.
>>>
>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>>>> Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait)
>>>> Cc: Anchal Agarwal <anchalag@amazon.com>
>>>> Cc: Frank van der Linden <fllinden@amazon.com>
>>>> ---
>>>>  block/blk-wbt.c | 78 +++++++++++++++++++++++++++++++++++++++------------------
>>>>  1 file changed, 54 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>>>> index c9358f1..2667590 100644
>>>> --- a/block/blk-wbt.c
>>>> +++ b/block/blk-wbt.c
>>>> @@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
>>>>  		int diff = limit - inflight;
>>>>  
>>>>  		if (!inflight || diff >= rwb->wb_background / 2)
>>>> -			wake_up(&rqw->wait);
>>>> +			wake_up_all(&rqw->wait);
>>>>  	}
>>>>  }
>>>>  
>>>> @@ -481,6 +481,40 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
>>>>  	return limit;
>>>>  }
>>>>  
>>>> +struct wbt_wait_data {
>>>> +	struct task_struct *curr;
>>>> +	struct rq_wb *rwb;
>>>> +	struct rq_wait *rqw;
>>>> +	unsigned long rw;
>>>> +};
>>>> +
>>>> +static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
>>>> +		int wake_flags, void *key)
>>>> +{
>>>> +	struct wbt_wait_data *data = curr->private;
>>>> +
>>>> +	/*
>>>> +	 * If fail to get budget, return -1 to interrupt the wake up
>>>> +	 * loop in __wake_up_common.
>>>> +	 */
>>>> +	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
>>>> +		return -1;
>>>> +
>>>> +	wake_up_process(data->curr);
>>>> +
>>>> +	list_del_init(&curr->entry);
>>>> +	return 1;
>>>> +}
>>>> +
>>>> +static inline void wbt_init_wait(struct wait_queue_entry *wait,
>>>> +		struct wbt_wait_data *data)
>>>> +{
>>>> +	INIT_LIST_HEAD(&wait->entry);
>>>> +	wait->flags = 0;
>>>> +	wait->func = wbt_wake_function;
>>>> +	wait->private = data;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Block if we will exceed our limit, or if we are currently waiting for
>>>>   * the timer to kick off queuing again.
>>>> @@ -491,31 +525,27 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>>>>  	__acquires(lock)
>>>>  {
>>>>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
>>>> -	DECLARE_WAITQUEUE(wait, current);
>>>> -	bool has_sleeper;
>>>> -
>>>> -	has_sleeper = wq_has_sleeper(&rqw->wait);
>>>> -	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>>>> +	struct wait_queue_entry wait;
>>>> +	struct wbt_wait_data data = {
>>>> +		.curr = current,
>>>> +		.rwb = rwb,
>>>> +		.rqw = rqw,
>>>> +		.rw = rw,
>>>> +	};
>>>> +
>>>> +	if (!wq_has_sleeper(&rqw->wait) &&
>>>> +			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>>>>  		return;
>>>>  
>>>> -	add_wait_queue_exclusive(&rqw->wait, &wait);
>>>> -	do {
>>>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>>>> -
>>>> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>>>> -			break;
>>>> -
>>>> -		if (lock) {
>>>> -			spin_unlock_irq(lock);
>>>> -			io_schedule();
>>>> -			spin_lock_irq(lock);
>>>> -		} else
>>>> -			io_schedule();
>>>> -		has_sleeper = false;
>>>> -	} while (1);
>>>> -
>>>> -	__set_current_state(TASK_RUNNING);
>>>> -	remove_wait_queue(&rqw->wait, &wait);
>>>> +	wbt_init_wait(&wait, &data);
>>>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
>>>> +			TASK_UNINTERRUPTIBLE);
>>>> +	if (lock) {
>>>> +		spin_unlock_irq(lock);
>>>> +		io_schedule();
>>>> +		spin_lock_irq(lock);
>>>> +	} else
>>>> +		io_schedule();
>>>>  }
>>>>  
>>>>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
>>>>
>>>
>> I like it, this combines some of the ideas outlined in our previous
>> thread in to a good solution.
>>
>> One comment, though: I think we need to set the task state back to
>> TASK_RUNNING after being woken up, right?
>>
>> Frank
>>
>>
> 
> Hi Jens,
> This patch looks much cleaner for sure as Frank pointed out too. Basically this looks similar
> to wake_up_nr only making sure that those woken up requests won't get reordered. This does 
> solves the thundering herd issue. However, I tested the patch against my application and 
> lock contention numbers rose to around 10 times from what I had from your last 3 patches. 
> Again this did add to drop in of total files read by 0.12% and rate at which they were read by
> 0.02% but this is not a very significant drop. Is lock contention worth the tradeoff?
> I also added missing  __set_current_state(TASK_RUNNING) to the patch for testing.

Can you try this variant? I don't think we need a __set_current_state() after
io_schedule(), should be fine as-is.

I'm not surprised this will raise contention a bit, since we're now waking N tasks
potentially, if N can queue. With the initial change, we'd always just wake one.
That is arguably incorrect. You say it's 10 times higher contention, how does
that compare to before your patch?

Is it possible to run something that looks like your workload?


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 84507d3e9a98..00a57aa160b8 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+			wake_up_all(&rqw->wait);
 	}
 }
 
@@ -481,6 +481,40 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+};
+
+static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
+		int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = curr->private;
+
+	/*
+	 * If fail to get budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	wake_up_process(data->curr);
+
+	list_del_init(&curr->entry);
+	return 1;
+}
+
+static inline void wbt_init_wait(struct wait_queue_entry *wait,
+				 struct wbt_wait_data *data)
+{
+	INIT_LIST_HEAD(&wait->entry);
+	wait->flags = 0;
+	wait->func = wbt_wake_function;
+	wait->private = data;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,31 +525,33 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
+	struct wait_queue_entry wait;
+	struct wbt_wait_data data = {
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
 	bool has_sleeper;
 
 	has_sleeper = wq_has_sleeper(&rqw->wait);
 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+	wbt_init_wait(&wait, &data);
+	prepare_to_wait_exclusive(&rqw->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-			break;
+	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+		finish_wait(&rqw->wait, &wait);
+		return;
+	}
 
-		if (lock) {
-			spin_unlock_irq(lock);
-			io_schedule();
-			spin_lock_irq(lock);
-		} else
-			io_schedule();
-		has_sleeper = false;
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else
+		io_schedule();
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 23:03       ` Jens Axboe
@ 2018-08-23 23:14         ` Jens Axboe
  2018-08-24  5:55           ` jianchao.wang
       [not found]         ` <20180824181223.GA9049@kaos-source-ops-60001.pdx1.amazon.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-23 23:14 UTC (permalink / raw)
  To: Anchal Agarwal, van der Linden, Frank, jianchao.w.wang
  Cc: mlinux-block@vger.kernel.org, linux-kernel

On 8/23/18 5:03 PM, Jens Axboe wrote:
>> Hi Jens, This patch looks much cleaner for sure as Frank pointed out
>> too. Basically this looks similar to wake_up_nr only making sure that
>> those woken up requests won't get reordered. This does solves the
>> thundering herd issue. However, I tested the patch against my
>> application and lock contention numbers rose to around 10 times from
>> what I had from your last 3 patches.  Again this did add to drop in
>> of total files read by 0.12% and rate at which they were read by
>> 0.02% but this is not a very significant drop. Is lock contention
>> worth the tradeoff?  I also added missing
>> __set_current_state(TASK_RUNNING) to the patch for testing.
> 
> Can you try this variant? I don't think we need a
> __set_current_state() after io_schedule(), should be fine as-is.
> 
> I'm not surprised this will raise contention a bit, since we're now
> waking N tasks potentially, if N can queue. With the initial change,
> we'd always just wake one.  That is arguably incorrect. You say it's
> 10 times higher contention, how does that compare to before your
> patch?
> 
> Is it possible to run something that looks like your workload?

Additionally, is the contention you are seeing the wait queue, or the
atomic counter? When you say lock contention, I'm inclined to think it's
the rqw->wait.lock.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 15:42 ` Jens Axboe
@ 2018-08-24  2:06   ` jianchao.wang
  2018-08-24 14:40     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: jianchao.wang @ 2018-08-24  2:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

Hi Jens

On 08/23/2018 11:42 PM, Jens Axboe wrote:
>> -
>> -	__set_current_state(TASK_RUNNING);
>> -	remove_wait_queue(&rqw->wait, &wait);
>> +	wbt_init_wait(&wait, &data);
>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
>> +			TASK_UNINTERRUPTIBLE);
>> +	if (lock) {
>> +		spin_unlock_irq(lock);
>> +		io_schedule();
>> +		spin_lock_irq(lock);
>> +	} else
>> +		io_schedule();
> Aren't we still missing a get-token attempt after adding to the
> waitqueue? For the case where someone frees the token after your initial
> check, but before you add yourself to the waitqueue.

I used to think about this.
However, there is a very tricky scenario here:
We will try get the wbt budget in wbt_wake_function.
After add a task into the wait queue, wbt_wake_function has been able to
be invoked for this task. If we get the wbt budget after prepare_to_wait_exclusive,
we may get wbt budget twice.

Thanks
Jianchao                                                               

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-23 23:14         ` Jens Axboe
@ 2018-08-24  5:55           ` jianchao.wang
  2018-08-24 16:40             ` van der Linden, Frank
  0 siblings, 1 reply; 22+ messages in thread
From: jianchao.wang @ 2018-08-24  5:55 UTC (permalink / raw)
  To: Jens Axboe, Anchal Agarwal, van der Linden, Frank
  Cc: mlinux-block@vger.kernel.org, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]



On 08/24/2018 07:14 AM, Jens Axboe wrote:
> On 8/23/18 5:03 PM, Jens Axboe wrote:
>>> Hi Jens, This patch looks much cleaner for sure as Frank pointed out
>>> too. Basically this looks similar to wake_up_nr only making sure that
>>> those woken up requests won't get reordered. This does solves the
>>> thundering herd issue. However, I tested the patch against my
>>> application and lock contention numbers rose to around 10 times from
>>> what I had from your last 3 patches.  Again this did add to drop in
>>> of total files read by 0.12% and rate at which they were read by
>>> 0.02% but this is not a very significant drop. Is lock contention
>>> worth the tradeoff?  I also added missing
>>> __set_current_state(TASK_RUNNING) to the patch for testing.
>>
>> Can you try this variant? I don't think we need a
>> __set_current_state() after io_schedule(), should be fine as-is.
>>
>> I'm not surprised this will raise contention a bit, since we're now
>> waking N tasks potentially, if N can queue. With the initial change,
>> we'd always just wake one.  That is arguably incorrect. You say it's
>> 10 times higher contention, how does that compare to before your
>> patch?
>>
>> Is it possible to run something that looks like your workload?
> 
> Additionally, is the contention you are seeing the wait queue, or the
> atomic counter? When you say lock contention, I'm inclined to think it's
> the rqw->wait.lock.
> 

I guess the increased lock contend is due to:
when the wake up is ongoing with wait head lock is held, there is still waiter
on wait queue, and __wbt_wait will go to wait and try to require the wait head lock.
This is necessary to keep the order on the rqw->wait queue.

The attachment does following thing to try to avoid the scenario above.
"
Introduce wait queue rqw->delayed. Try to lock rqw->wait.lock firstly, if fails, add
the waiter on rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up and
queue them on the tail of rqw->wait before it do wake up operation.
"

Thanks
Jianchao

[-- Attachment #2: 0001-blk-wbt-get-back-the-missed-wakeup-from-__wbt_done.patch --]
[-- Type: text/x-patch, Size: 7145 bytes --]

From c410983b85d5624a9b4ff579240818dfaab2f5db Mon Sep 17 00:00:00 2001
From: Jianchao Wang <jianchao.w.wang@oracle.com>
Date: Fri, 24 Aug 2018 02:10:21 +0800
Subject: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done

2887e41 (blk-wbt: Avoid lock contention and thundering herd
issue in wbt_wait) introduces two cases that could miss wakeup:
 - __wbt_done only wakes up one waiter one time. There could be
   multiple waiters and (limit - inflight) > 1 at the moment.

 - When the waiter is waked up, it is still on wait queue and set
   to TASK_UNINTERRUPTIBLE immediately, so this waiter could be
   waked up one more time. If a __wbt_done comes and wakes up
   again, the prevous waiter may waste a wakeup.

To fix them, we introduce our own wake up func wbt_wake_function
in __wbt_wait and use wake_up_all in __wbt_done. wbt_wake_function
will try to get wbt budget firstly, if sucesses, wake up
the process, otherwise, return -1 to interrupt the wake up loop.

To prevent disorder on rqw->wait, __wbt_wait will firstly check
whether there is sleeper on it before try to get wbt budget, if
has sleeper, it will go to wait directly. Because the wake up
operations which holds rqw->wait.lock could take a relatively long
time, extra lock contention could be introduced when __wbt_wait try
to add itself on rqw->wait.

To avoid this, introduce wait queue rqw->delayed. We will try to
lock rqw->wait.lock firstly, if fails, add the waiter on
rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up
and queue them on the tail of rqw->wait before it do wake up
operation.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait)
Cc: Anchal Agarwal <anchalag@amazon.com>
Cc: Frank van der Linden <fllinden@amazon.com>
---
 block/blk-rq-qos.h |   2 +
 block/blk-wbt.c    | 130 ++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 105 insertions(+), 27 deletions(-)

diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 32b02ef..a6111eb 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -14,6 +14,7 @@ enum rq_qos_id {
 
 struct rq_wait {
 	wait_queue_head_t wait;
+	wait_queue_head_t delayed;
 	atomic_t inflight;
 };
 
@@ -70,6 +71,7 @@ static inline void rq_wait_init(struct rq_wait *rq_wait)
 {
 	atomic_set(&rq_wait->inflight, 0);
 	init_waitqueue_head(&rq_wait->wait);
+	init_waitqueue_head(&rq_wait->delayed);
 }
 
 static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c9358f1..7fa61fe 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -111,6 +111,23 @@ static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
 	return &rwb->rq_wait[WBT_RWQ_BG];
 }
 
+static void wbt_check_delayed(struct rq_wait *rqw)
+{
+	unsigned long flags;
+
+	if (!wq_has_sleeper(&rqw->delayed))
+		return;
+	/*
+	 * Avoid to require wait->lock holding delayed->lock, that will
+	 * introduce undesire contending on delayed->lock.
+	 */
+	spin_lock_irqsave(&rqw->wait.lock, flags);
+	spin_lock(&rqw->delayed.lock);
+	list_splice_tail_init(&rqw->delayed.head, &rqw->wait.head);
+	spin_unlock(&rqw->delayed.lock);
+	spin_unlock_irqrestore(&rqw->wait.lock, flags);
+}
+
 static void rwb_wake_all(struct rq_wb *rwb)
 {
 	int i;
@@ -118,8 +135,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
 	for (i = 0; i < WBT_NUM_RWQ; i++) {
 		struct rq_wait *rqw = &rwb->rq_wait[i];
 
-		if (wq_has_sleeper(&rqw->wait))
+		if (wq_has_sleeper(&rqw->wait) ||
+				waitqueue_active(&rqw->delayed)) {
+			wbt_check_delayed(rqw);
 			wake_up_all(&rqw->wait);
+		}
 	}
 }
 
@@ -162,11 +182,14 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 	if (inflight && inflight >= limit)
 		return;
 
-	if (wq_has_sleeper(&rqw->wait)) {
+	if (wq_has_sleeper(&rqw->wait) ||
+			waitqueue_active(&rqw->delayed)) {
 		int diff = limit - inflight;
 
-		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+		if (!inflight || diff >= rwb->wb_background / 2) {
+			wbt_check_delayed(rqw);
+			wake_up_all(&rqw->wait);
+		}
 	}
 }
 
@@ -481,6 +504,64 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+};
+
+static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
+		int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = curr->private;
+
+	/*
+	 * If fail to get budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	wake_up_process(data->curr);
+
+	list_del_init(&curr->entry);
+	return 1;
+}
+
+static inline void wbt_init_wait(struct wait_queue_entry *wait,
+		struct wbt_wait_data *data)
+{
+	INIT_LIST_HEAD(&wait->entry);
+	wait->flags = 0;
+	wait->func = wbt_wake_function;
+	wait->private = data;
+}
+
+static void wbt_add_wait(struct rq_wait *rqw, struct wait_queue_entry *wq_entry)
+{
+	unsigned long flags;
+
+	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
+
+	if (spin_trylock_irqsave(&rqw->wait.lock, flags)) {
+		if (list_empty(&wq_entry->entry))
+			__add_wait_queue_entry_tail(&rqw->wait, wq_entry);
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock_irqrestore(&rqw->wait.lock, flags);
+	} else {
+		/*
+		 * __wbt_done will pick the delayed waiters up and add them on
+		 * the tail of rqw->wait queue.
+		 */
+		spin_lock_irqsave(&rqw->delayed.lock, flags);
+		if (list_empty(&wq_entry->entry))
+			__add_wait_queue_entry_tail(&rqw->delayed, wq_entry);
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock_irqrestore(&rqw->delayed.lock, flags);
+	}
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,31 +572,26 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
-	bool has_sleeper;
-
-	has_sleeper = wq_has_sleeper(&rqw->wait);
-	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+	struct wait_queue_entry wait;
+	struct wbt_wait_data data = {
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
+
+	if (!wq_has_sleeper(&rqw->wait) &&
+			rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-			break;
-
-		if (lock) {
-			spin_unlock_irq(lock);
-			io_schedule();
-			spin_lock_irq(lock);
-		} else
-			io_schedule();
-		has_sleeper = false;
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+	wbt_init_wait(&wait, &data);
+	wbt_add_wait(rqw, &wait);
+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else
+		io_schedule();
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24  2:06   ` jianchao.wang
@ 2018-08-24 14:40     ` Jens Axboe
  2018-08-24 14:58       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 14:40 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-block, linux-kernel

On 8/23/18 8:06 PM, jianchao.wang wrote:
> Hi Jens
> 
> On 08/23/2018 11:42 PM, Jens Axboe wrote:
>>> -
>>> -	__set_current_state(TASK_RUNNING);
>>> -	remove_wait_queue(&rqw->wait, &wait);
>>> +	wbt_init_wait(&wait, &data);
>>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
>>> +			TASK_UNINTERRUPTIBLE);
>>> +	if (lock) {
>>> +		spin_unlock_irq(lock);
>>> +		io_schedule();
>>> +		spin_lock_irq(lock);
>>> +	} else
>>> +		io_schedule();
>> Aren't we still missing a get-token attempt after adding to the
>> waitqueue? For the case where someone frees the token after your initial
>> check, but before you add yourself to the waitqueue.
> 
> I used to think about this.
> However, there is a very tricky scenario here:
> We will try get the wbt budget in wbt_wake_function.
> After add a task into the wait queue, wbt_wake_function has been able to
> be invoked for this task. If we get the wbt budget after prepare_to_wait_exclusive,
> we may get wbt budget twice.

Ah yes good point. But without it, you've got another race that will
potentially put you to sleep forever.

How about something like the below? That should take care of both
situations. Totally untested.


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 84507d3e9a98..3d36bd158301 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -166,7 +166,7 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+			wake_up_all(&rqw->wait);
 	}
 }
 
@@ -481,6 +481,32 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+	bool got_token;
+};
+
+static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
+			     int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = curr->private;
+
+	/*
+	 * If fail to get budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	data->got_token = true;
+	wake_up_process(data->curr);
+	list_del_init(&curr->entry);
+	return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,31 +517,46 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
+	struct wbt_wait_data data = {
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
+	struct wait_queue_entry wait = {
+		.func		= wbt_wake_function,
+		.private	= &data,
+		.entry		= LIST_HEAD_INIT(wait.entry),
+	};
 	bool has_sleeper;
 
 	has_sleeper = wq_has_sleeper(&rqw->wait);
 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+	prepare_to_wait_exclusive(&rqw->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-			break;
+	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+		finish_wait(&rqw->wait, &wait);
+
+		/*
+		 * We raced with wbt_wake_function() getting a token, which
+		 * means we now have two. Put ours and wake anyone else
+		 * potentially waiting for one.
+		 */
+		if (data.got_token) {
+			atomic_dec(&rqw->inflight);
+			wake_up_all(&rqw->wait);
+		}
+		return;
+	}
 
-		if (lock) {
-			spin_unlock_irq(lock);
-			io_schedule();
-			spin_lock_irq(lock);
-		} else
-			io_schedule();
-		has_sleeper = false;
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else
+		io_schedule();
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24 14:40     ` Jens Axboe
@ 2018-08-24 14:58       ` Jens Axboe
  2018-08-24 17:14         ` Eduardo Valentin
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 14:58 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-block, linux-kernel

On 8/24/18 8:40 AM, Jens Axboe wrote:
> On 8/23/18 8:06 PM, jianchao.wang wrote:
>> Hi Jens
>>
>> On 08/23/2018 11:42 PM, Jens Axboe wrote:
>>>> -
>>>> -	__set_current_state(TASK_RUNNING);
>>>> -	remove_wait_queue(&rqw->wait, &wait);
>>>> +	wbt_init_wait(&wait, &data);
>>>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
>>>> +			TASK_UNINTERRUPTIBLE);
>>>> +	if (lock) {
>>>> +		spin_unlock_irq(lock);
>>>> +		io_schedule();
>>>> +		spin_lock_irq(lock);
>>>> +	} else
>>>> +		io_schedule();
>>> Aren't we still missing a get-token attempt after adding to the
>>> waitqueue? For the case where someone frees the token after your initial
>>> check, but before you add yourself to the waitqueue.
>>
>> I used to think about this.
>> However, there is a very tricky scenario here:
>> We will try get the wbt budget in wbt_wake_function.
>> After add a task into the wait queue, wbt_wake_function has been able to
>> be invoked for this task. If we get the wbt budget after prepare_to_wait_exclusive,
>> we may get wbt budget twice.
> 
> Ah yes good point. But without it, you've got another race that will
> potentially put you to sleep forever.
> 
> How about something like the below? That should take care of both
> situations. Totally untested.

Slightly better/cleaner one below. Still totally untested.


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 84507d3e9a98..bc13544943ff 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -123,16 +123,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
 	}
 }
 
-static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+			 enum wbt_flags wb_acct)
 {
-	struct rq_wb *rwb = RQWB(rqos);
-	struct rq_wait *rqw;
 	int inflight, limit;
 
-	if (!(wb_acct & WBT_TRACKED))
-		return;
-
-	rqw = get_rq_wait(rwb, wb_acct);
 	inflight = atomic_dec_return(&rqw->inflight);
 
 	/*
@@ -166,8 +161,21 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+			wake_up_all(&rqw->wait);
 	}
+
+}
+
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+{
+	struct rq_wb *rwb = RQWB(rqos);
+	struct rq_wait *rqw;
+
+	if (!(wb_acct & WBT_TRACKED))
+		return;
+
+	rqw = get_rq_wait(rwb, wb_acct);
+	wbt_rqw_done(rwb, rqw, wb_acct);
 }
 
 /*
@@ -481,6 +489,32 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+	bool got_token;
+};
+
+static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
+			     int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = curr->private;
+
+	/*
+	 * If we fail to get a budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	data->got_token = true;
+	wake_up_process(data->curr);
+	list_del_init(&curr->entry);
+	return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,31 +525,44 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
+	struct wbt_wait_data data = {
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
+	struct wait_queue_entry wait = {
+		.func		= wbt_wake_function,
+		.private	= &data,
+		.entry		= LIST_HEAD_INIT(wait.entry),
+	};
 	bool has_sleeper;
 
 	has_sleeper = wq_has_sleeper(&rqw->wait);
 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+	prepare_to_wait_exclusive(&rqw->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
-			break;
+	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+		finish_wait(&rqw->wait, &wait);
 
-		if (lock) {
-			spin_unlock_irq(lock);
-			io_schedule();
-			spin_lock_irq(lock);
-		} else
-			io_schedule();
-		has_sleeper = false;
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+		/*
+		 * We raced with wbt_wake_function() getting a token, which
+		 * means we now have two. Put ours and wake anyone else
+		 * potentially waiting for one.
+		 */
+		if (data.got_token)
+			wbt_rqw_done(rwb, rqw, wb_acct);
+		return;
+	}
+
+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else
+		io_schedule();
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24  5:55           ` jianchao.wang
@ 2018-08-24 16:40             ` van der Linden, Frank
  2018-08-24 16:44               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: van der Linden, Frank @ 2018-08-24 16:40 UTC (permalink / raw)
  To: jianchao.wang, Jens Axboe, Anchal Agarwal
  Cc: mlinux-block@vger.kernel.org, linux-kernel

On 8/23/18 10:56 PM, jianchao.wang wrote:
>
> On 08/24/2018 07:14 AM, Jens Axboe wrote:
>> On 8/23/18 5:03 PM, Jens Axboe wrote:
>>>> Hi Jens, This patch looks much cleaner for sure as Frank pointed out
>>>> too. Basically this looks similar to wake_up_nr only making sure that
>>>> those woken up requests won't get reordered. This does solves the
>>>> thundering herd issue. However, I tested the patch against my
>>>> application and lock contention numbers rose to around 10 times from
>>>> what I had from your last 3 patches.  Again this did add to drop in
>>>> of total files read by 0.12% and rate at which they were read by
>>>> 0.02% but this is not a very significant drop. Is lock contention
>>>> worth the tradeoff?  I also added missing
>>>> __set_current_state(TASK_RUNNING) to the patch for testing.
>>> Can you try this variant? I don't think we need a
>>> __set_current_state() after io_schedule(), should be fine as-is.
>>>
>>> I'm not surprised this will raise contention a bit, since we're now
>>> waking N tasks potentially, if N can queue. With the initial change,
>>> we'd always just wake one.  That is arguably incorrect. You say it's
>>> 10 times higher contention, how does that compare to before your
>>> patch?
>>>
>>> Is it possible to run something that looks like your workload?
>> Additionally, is the contention you are seeing the wait queue, or the
>> atomic counter? When you say lock contention, I'm inclined to think it's
>> the rqw->wait.lock.
>>
> I guess the increased lock contend is due to:
> when the wake up is ongoing with wait head lock is held, there is still waiter
> on wait queue, and __wbt_wait will go to wait and try to require the wait head lock.
> This is necessary to keep the order on the rqw->wait queue.
>
> The attachment does following thing to try to avoid the scenario above.
> "
> Introduce wait queue rqw->delayed. Try to lock rqw->wait.lock firstly, if fails, add
> the waiter on rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up and
> queue them on the tail of rqw->wait before it do wake up operation.
> "
>
Hmm, I am not sure about this one. Sure, it will reduce lock contention
for the waitq lock, but it also introduces more complexity.

It's expected that there will be more contention if the waitq lock is
held longer. That's the tradeoff for waking up more throttled tasks and
making progress faster. Is this added complexity worth the gains? My
first inclination would be to say no.

If lock contention on a wait queue is an issue, then either the wait
queue mechanism itself should be improved, or the code that uses the
wait queue should be fixed. Also, the contention is still a lot lower
than it used to be.

- Frank

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24 16:40             ` van der Linden, Frank
@ 2018-08-24 16:44               ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 16:44 UTC (permalink / raw)
  To: van der Linden, Frank, jianchao.wang, Anchal Agarwal
  Cc: mlinux-block@vger.kernel.org, linux-kernel

On 8/24/18 10:40 AM, van der Linden, Frank wrote:
> On 8/23/18 10:56 PM, jianchao.wang wrote:
>>
>> On 08/24/2018 07:14 AM, Jens Axboe wrote:
>>> On 8/23/18 5:03 PM, Jens Axboe wrote:
>>>>> Hi Jens, This patch looks much cleaner for sure as Frank pointed out
>>>>> too. Basically this looks similar to wake_up_nr only making sure that
>>>>> those woken up requests won't get reordered. This does solves the
>>>>> thundering herd issue. However, I tested the patch against my
>>>>> application and lock contention numbers rose to around 10 times from
>>>>> what I had from your last 3 patches.  Again this did add to drop in
>>>>> of total files read by 0.12% and rate at which they were read by
>>>>> 0.02% but this is not a very significant drop. Is lock contention
>>>>> worth the tradeoff?  I also added missing
>>>>> __set_current_state(TASK_RUNNING) to the patch for testing.
>>>> Can you try this variant? I don't think we need a
>>>> __set_current_state() after io_schedule(), should be fine as-is.
>>>>
>>>> I'm not surprised this will raise contention a bit, since we're now
>>>> waking N tasks potentially, if N can queue. With the initial change,
>>>> we'd always just wake one.  That is arguably incorrect. You say it's
>>>> 10 times higher contention, how does that compare to before your
>>>> patch?
>>>>
>>>> Is it possible to run something that looks like your workload?
>>> Additionally, is the contention you are seeing the wait queue, or the
>>> atomic counter? When you say lock contention, I'm inclined to think it's
>>> the rqw->wait.lock.
>>>
>> I guess the increased lock contend is due to:
>> when the wake up is ongoing with wait head lock is held, there is still waiter
>> on wait queue, and __wbt_wait will go to wait and try to require the wait head lock.
>> This is necessary to keep the order on the rqw->wait queue.
>>
>> The attachment does following thing to try to avoid the scenario above.
>> "
>> Introduce wait queue rqw->delayed. Try to lock rqw->wait.lock firstly, if fails, add
>> the waiter on rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up and
>> queue them on the tail of rqw->wait before it do wake up operation.
>> "
>>
> Hmm, I am not sure about this one. Sure, it will reduce lock contention
> for the waitq lock, but it also introduces more complexity.
> 
> It's expected that there will be more contention if the waitq lock is
> held longer. That's the tradeoff for waking up more throttled tasks and
> making progress faster. Is this added complexity worth the gains? My
> first inclination would be to say no.
> 
> If lock contention on a wait queue is an issue, then either the wait
> queue mechanism itself should be improved, or the code that uses the
> wait queue should be fixed. Also, the contention is still a lot lower
> than it used to be.

Hard to disagree with that. If you look at the blk-mq tagging code, it
spreads out the wait queues exactly because of this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24 14:58       ` Jens Axboe
@ 2018-08-24 17:14         ` Eduardo Valentin
  2018-08-24 17:17           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Eduardo Valentin @ 2018-08-24 17:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: jianchao.wang, linux-block, linux-kernel, Anchal Agarwal,
	Frank van der Linden

Adding the Frank and Anchal, as this part of the thread is not copying them.

Just a minor comment at the bottom.

On Fri, Aug 24, 2018 at 08:58:09AM -0600, Jens Axboe wrote:
> On 8/24/18 8:40 AM, Jens Axboe wrote:
> > On 8/23/18 8:06 PM, jianchao.wang wrote:
> >> Hi Jens
> >>
> >> On 08/23/2018 11:42 PM, Jens Axboe wrote:
> >>>> -
> >>>> -	__set_current_state(TASK_RUNNING);
> >>>> -	remove_wait_queue(&rqw->wait, &wait);
> >>>> +	wbt_init_wait(&wait, &data);
> >>>> +	prepare_to_wait_exclusive(&rqw->wait, &wait,
> >>>> +			TASK_UNINTERRUPTIBLE);
> >>>> +	if (lock) {
> >>>> +		spin_unlock_irq(lock);
> >>>> +		io_schedule();
> >>>> +		spin_lock_irq(lock);
> >>>> +	} else
> >>>> +		io_schedule();
> >>> Aren't we still missing a get-token attempt after adding to the
> >>> waitqueue? For the case where someone frees the token after your initial
> >>> check, but before you add yourself to the waitqueue.
> >>
> >> I used to think about this.
> >> However, there is a very tricky scenario here:
> >> We will try get the wbt budget in wbt_wake_function.
> >> After add a task into the wait queue, wbt_wake_function has been able to
> >> be invoked for this task. If we get the wbt budget after prepare_to_wait_exclusive,
> >> we may get wbt budget twice.
> > 
> > Ah yes good point. But without it, you've got another race that will
> > potentially put you to sleep forever.
> > 
> > How about something like the below? That should take care of both
> > situations. Totally untested.
> 
> Slightly better/cleaner one below. Still totally untested.
> 
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 84507d3e9a98..bc13544943ff 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -123,16 +123,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
>  	}
>  }
>  
> -static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
> +static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
> +			 enum wbt_flags wb_acct)
>  {
> -	struct rq_wb *rwb = RQWB(rqos);
> -	struct rq_wait *rqw;
>  	int inflight, limit;
>  
> -	if (!(wb_acct & WBT_TRACKED))
> -		return;
> -
> -	rqw = get_rq_wait(rwb, wb_acct);
>  	inflight = atomic_dec_return(&rqw->inflight);
>  
>  	/*
> @@ -166,8 +161,21 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
>  		int diff = limit - inflight;
>  
>  		if (!inflight || diff >= rwb->wb_background / 2)
> -			wake_up(&rqw->wait);
> +			wake_up_all(&rqw->wait);
>  	}
> +
> +}
> +
> +static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
> +{
> +	struct rq_wb *rwb = RQWB(rqos);
> +	struct rq_wait *rqw;
> +
> +	if (!(wb_acct & WBT_TRACKED))
> +		return;
> +
> +	rqw = get_rq_wait(rwb, wb_acct);
> +	wbt_rqw_done(rwb, rqw, wb_acct);
>  }
>  
>  /*
> @@ -481,6 +489,32 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
>  	return limit;
>  }
>  
> +struct wbt_wait_data {
> +	struct task_struct *curr;
> +	struct rq_wb *rwb;
> +	struct rq_wait *rqw;
> +	unsigned long rw;
> +	bool got_token;
> +};
> +
> +static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode,
> +			     int wake_flags, void *key)
> +{
> +	struct wbt_wait_data *data = curr->private;
> +
> +	/*
> +	 * If we fail to get a budget, return -1 to interrupt the wake up
> +	 * loop in __wake_up_common.
> +	 */
> +	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
> +		return -1;
> +
> +	data->got_token = true;
> +	wake_up_process(data->curr);
> +	list_del_init(&curr->entry);
> +	return 1;
> +}
> +
>  /*
>   * Block if we will exceed our limit, or if we are currently waiting for
>   * the timer to kick off queuing again.
> @@ -491,31 +525,44 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  	__acquires(lock)
>  {
>  	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
> -	DECLARE_WAITQUEUE(wait, current);
> +	struct wbt_wait_data data = {
> +		.curr = current,
> +		.rwb = rwb,
> +		.rqw = rqw,
> +		.rw = rw,
> +	};
> +	struct wait_queue_entry wait = {
> +		.func		= wbt_wake_function,
> +		.private	= &data,
> +		.entry		= LIST_HEAD_INIT(wait.entry),
> +	};
>  	bool has_sleeper;
>  
>  	has_sleeper = wq_has_sleeper(&rqw->wait);
>  	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>  		return;
>  
> -	add_wait_queue_exclusive(&rqw->wait, &wait);
> -	do {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +	prepare_to_wait_exclusive(&rqw->wait, &wait, TASK_UNINTERRUPTIBLE);
>  
> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> -			break;
> +	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
> +		finish_wait(&rqw->wait, &wait);
>  
> -		if (lock) {
> -			spin_unlock_irq(lock);
> -			io_schedule();
> -			spin_lock_irq(lock);
> -		} else
> -			io_schedule();
> -		has_sleeper = false;
> -	} while (1);
> -
> -	__set_current_state(TASK_RUNNING);
> -	remove_wait_queue(&rqw->wait, &wait);
> +		/*
> +		 * We raced with wbt_wake_function() getting a token, which
> +		 * means we now have two. Put ours and wake anyone else
> +		 * potentially waiting for one.
> +		 */
> +		if (data.got_token)
> +			wbt_rqw_done(rwb, rqw, wb_acct);
> +		return;
> +	}
> +
> +	if (lock) {
> +		spin_unlock_irq(lock);
> +		io_schedule();
> +		spin_lock_irq(lock);
> +	} else
> +		io_schedule();

Nitpick but, shouldn't this look like:

+	if (lock) {
+		spin_unlock_irq(lock);
+		io_schedule();
+		spin_lock_irq(lock);
+	} else {
+		io_schedule();
+	}


And another random though, it would be good to have some sort of tracing of this.

>  }
>  
>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
> 
> -- 
> Jens Axboe
> 
> 

-- 
All the best,
Eduardo Valentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24 17:14         ` Eduardo Valentin
@ 2018-08-24 17:17           ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 17:17 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: jianchao.wang, linux-block, linux-kernel, Anchal Agarwal,
	Frank van der Linden

On 8/24/18 11:14 AM, Eduardo Valentin wrote:
>> +	if (lock) {
>> +		spin_unlock_irq(lock);
>> +		io_schedule();
>> +		spin_lock_irq(lock);
>> +	} else
>> +		io_schedule();
> 
> Nitpick but, shouldn't this look like:
> 
> +	if (lock) {
> +		spin_unlock_irq(lock);
> +		io_schedule();
> +		spin_lock_irq(lock);
> +	} else {
> +		io_schedule();
> +	}

Depends on who you ask... I prefer the former.

> And another random though, it would be good to have some sort of
> tracing of this.

wbt does have tracing, but we've never had tracing on the sleep/wakeup
parts. But that's pretty much completely orthogonal to the issue, that
should be done as a separate patch, if useful.

BTW, I've now tested this and it seems to work fine for me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
       [not found]         ` <20180824181223.GA9049@kaos-source-ops-60001.pdx1.amazon.com>
@ 2018-08-24 18:50           ` Jens Axboe
       [not found]             ` <20180824203305.GA4690@kaos-source-ops-60001.pdx1.amazon.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 18:50 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: fllinden, Jianchao Wang, linux-block, linux-kernel, anchalag

On 8/24/18 12:12 PM, Anchal Agarwal wrote:
> That's totally fair. As compared to before the patch it was way too high
> and my test case wasn't even running due to the thunderign herd issues and
> queue re-ordering. Anyways as I also mentioned before 10 times 
> contention is not too bad since its not really affecting much the number of
> files read in my applciation. Also, you are right waking up N tasks seems 
> plausible. 

OK, I'm going to take that as a positive response. I'm going to propose
the last patch as the final addition in this round, since it does fix a
gap in the previous. And I do think that we need to wake as many tasks
as can make progress, otherwise we're deliberately running the device at
a lower load than we should.

> My application is somewhat similar to database workload. It does uses fsync 
> internally also. So what it does is it creates files of random sizes with 
> random contents. It stores the hash of those files in memory. During the 
> test it reads those files back from storage and checks their hashes. 

How many tasks are running for your test?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
       [not found]             ` <20180824203305.GA4690@kaos-source-ops-60001.pdx1.amazon.com>
@ 2018-08-24 20:41               ` Jens Axboe
  2018-08-25 15:41                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-24 20:41 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: fllinden, Jianchao Wang, linux-block, linux-kernel

On 8/24/18 2:33 PM, Anchal Agarwal wrote:
> On Fri, Aug 24, 2018 at 12:50:44PM -0600, Jens Axboe wrote:
>> On 8/24/18 12:12 PM, Anchal Agarwal wrote:
>>> That's totally fair. As compared to before the patch it was way too high
>>> and my test case wasn't even running due to the thunderign herd issues and
>>> queue re-ordering. Anyways as I also mentioned before 10 times 
>>> contention is not too bad since its not really affecting much the number of
>>> files read in my applciation. Also, you are right waking up N tasks seems 
>>> plausible. 
>>
>> OK, I'm going to take that as a positive response. I'm going to propose
>> the last patch as the final addition in this round, since it does fix a
>> gap in the previous. And I do think that we need to wake as many tasks
>> as can make progress, otherwise we're deliberately running the device at
>> a lower load than we should.
>>
>>> My application is somewhat similar to database workload. It does uses fsync 
>>> internally also. So what it does is it creates files of random sizes with 
>>> random contents. It stores the hash of those files in memory. During the 
>>> test it reads those files back from storage and checks their hashes. 
>>
>> How many tasks are running for your test?
>>
>> -- 
>> Jens Axboe
>>
>>
> 
> So there are 128 Concurrent reads/writes happening. Max files written before
> reads start is 16384 and each file is of size 512KB. Does that answer your
> question?

Yes it does, thanks. That's not a crazy amount of tasks or threads.

> BTW, I still have to test the last patch you send but by looking at the patch 
> I assumed it will work anyways!

Thanks for the vote of confidence, I'd appreciate if you would give it a
whirl. Your workload seems nastier than what I test with, so would be
great to have someone else test it too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-24 20:41               ` Jens Axboe
@ 2018-08-25 15:41                 ` Jens Axboe
  2018-08-27  3:52                   ` jianchao.wang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-25 15:41 UTC (permalink / raw)
  To: Anchal Agarwal; +Cc: fllinden, Jianchao Wang, linux-block, linux-kernel

On 8/24/18 2:41 PM, Jens Axboe wrote:
> On 8/24/18 2:33 PM, Anchal Agarwal wrote:
>> On Fri, Aug 24, 2018 at 12:50:44PM -0600, Jens Axboe wrote:
>>> On 8/24/18 12:12 PM, Anchal Agarwal wrote:
>>>> That's totally fair. As compared to before the patch it was way too high
>>>> and my test case wasn't even running due to the thunderign herd issues and
>>>> queue re-ordering. Anyways as I also mentioned before 10 times 
>>>> contention is not too bad since its not really affecting much the number of
>>>> files read in my applciation. Also, you are right waking up N tasks seems 
>>>> plausible. 
>>>
>>> OK, I'm going to take that as a positive response. I'm going to propose
>>> the last patch as the final addition in this round, since it does fix a
>>> gap in the previous. And I do think that we need to wake as many tasks
>>> as can make progress, otherwise we're deliberately running the device at
>>> a lower load than we should.
>>>
>>>> My application is somewhat similar to database workload. It does uses fsync 
>>>> internally also. So what it does is it creates files of random sizes with 
>>>> random contents. It stores the hash of those files in memory. During the 
>>>> test it reads those files back from storage and checks their hashes. 
>>>
>>> How many tasks are running for your test?
>>>
>>> -- 
>>> Jens Axboe
>>>
>>>
>>
>> So there are 128 Concurrent reads/writes happening. Max files written before
>> reads start is 16384 and each file is of size 512KB. Does that answer your
>> question?
> 
> Yes it does, thanks. That's not a crazy amount of tasks or threads.
> 
>> BTW, I still have to test the last patch you send but by looking at the patch 
>> I assumed it will work anyways!
> 
> Thanks for the vote of confidence, I'd appreciate if you would give it a
> whirl. Your workload seems nastier than what I test with, so would be
> great to have someone else test it too.

This is what I have come up with, this actually survives some torture
testing. We do need to have the wait as a loop, since we can't rely on
just being woken from the ->func handler we set. It also handles the
prep token get race.


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 84507d3e9a98..2442b2b141b8 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -123,16 +123,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
 	}
 }
 
-static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+			 enum wbt_flags wb_acct)
 {
-	struct rq_wb *rwb = RQWB(rqos);
-	struct rq_wait *rqw;
 	int inflight, limit;
 
-	if (!(wb_acct & WBT_TRACKED))
-		return;
-
-	rqw = get_rq_wait(rwb, wb_acct);
 	inflight = atomic_dec_return(&rqw->inflight);
 
 	/*
@@ -166,8 +161,21 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
 		int diff = limit - inflight;
 
 		if (!inflight || diff >= rwb->wb_background / 2)
-			wake_up(&rqw->wait);
+			wake_up_all(&rqw->wait);
 	}
+
+}
+
+static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct)
+{
+	struct rq_wb *rwb = RQWB(rqos);
+	struct rq_wait *rqw;
+
+	if (!(wb_acct & WBT_TRACKED))
+		return;
+
+	rqw = get_rq_wait(rwb, wb_acct);
+	wbt_rqw_done(rwb, rqw, wb_acct);
 }
 
 /*
@@ -481,6 +489,33 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	return limit;
 }
 
+struct wbt_wait_data {
+	struct wait_queue_entry wq;
+	struct task_struct *curr;
+	struct rq_wb *rwb;
+	struct rq_wait *rqw;
+	unsigned long rw;
+	unsigned long flags;
+};
+
+static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+			     int wake_flags, void *key)
+{
+	struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data, wq);
+
+	/*
+	 * If we fail to get a budget, return -1 to interrupt the wake up
+	 * loop in __wake_up_common.
+	 */
+	if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+		return -1;
+
+	set_bit(0, &data->flags);
+	wake_up_process(data->curr);
+	list_del_init(&curr->entry);
+	return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -491,19 +526,42 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 	__acquires(lock)
 {
 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
-	DECLARE_WAITQUEUE(wait, current);
+	struct wbt_wait_data data = {
+		.wq = {
+			.func	= wbt_wake_function,
+			.entry	= LIST_HEAD_INIT(data.wq.entry),
+		},
+		.curr = current,
+		.rwb = rwb,
+		.rqw = rqw,
+		.rw = rw,
+	};
 	bool has_sleeper;
 
 	has_sleeper = wq_has_sleeper(&rqw->wait);
 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
 		return;
 
-	add_wait_queue_exclusive(&rqw->wait, &wait);
+	prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
 	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (test_bit(0, &data.flags))
+			break;
 
-		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+		WARN_ON_ONCE(list_empty(&data.wq.entry));
+
+		if (!has_sleeper &&
+		    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+			finish_wait(&rqw->wait, &data.wq);
+
+			/*
+			 * We raced with wbt_wake_function() getting a token,
+			 * which means we now have two. Put ours and wake
+			 * anyone else potentially waiting for one.
+			 */
+			if (test_bit(0, &data.flags))
+				wbt_rqw_done(rwb, rqw, wb_acct);
 			break;
+		}
 
 		if (lock) {
 			spin_unlock_irq(lock);
@@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
 			spin_lock_irq(lock);
 		} else
 			io_schedule();
+
 		has_sleeper = false;
 	} while (1);
 
-	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(&rqw->wait, &wait);
+	finish_wait(&rqw->wait, &data.wq);
 }
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-25 15:41                 ` Jens Axboe
@ 2018-08-27  3:52                   ` jianchao.wang
  2018-08-27  6:15                     ` jianchao.wang
  2018-08-27 15:37                     ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: jianchao.wang @ 2018-08-27  3:52 UTC (permalink / raw)
  To: Jens Axboe, Anchal Agarwal; +Cc: fllinden, linux-block, linux-kernel

Hi Jens

On 08/25/2018 11:41 PM, Jens Axboe wrote:
>  	do {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (test_bit(0, &data.flags))
> +			break;
>  
> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> +		WARN_ON_ONCE(list_empty(&data.wq.entry));
> +
> +		if (!has_sleeper &&
> +		    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
> +			finish_wait(&rqw->wait, &data.wq);
> +
> +			/*
> +			 * We raced with wbt_wake_function() getting a token,
> +			 * which means we now have two. Put ours and wake
> +			 * anyone else potentially waiting for one.
> +			 */
> +			if (test_bit(0, &data.flags))
> +				wbt_rqw_done(rwb, rqw, wb_acct);
>  			break;

Just use 'bool' variable should be OK 
After finish_wait, no one could race with us here.

> +		}
>  
>  		if (lock) {
>  			spin_unlock_irq(lock);
> @@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>  			spin_lock_irq(lock);
>  		} else
>  			io_schedule();
> +
>  		has_sleeper = false;
>  	} while (1);

I cannot get the point of "since we can't rely on just being woken from the ->func handler
we set".
Do you mean there could be someone else could wake up this task ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-27  3:52                   ` jianchao.wang
@ 2018-08-27  6:15                     ` jianchao.wang
  2018-08-27 14:51                       ` Jens Axboe
  2018-08-27 15:37                     ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: jianchao.wang @ 2018-08-27  6:15 UTC (permalink / raw)
  To: Jens Axboe, Anchal Agarwal; +Cc: fllinden, linux-block, linux-kernel



On 08/27/2018 11:52 AM, jianchao.wang wrote:
> Hi Jens
> 
> On 08/25/2018 11:41 PM, Jens Axboe wrote:
>>  	do {
>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>> +		if (test_bit(0, &data.flags))
>> +			break;
>>  
>> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>> +		WARN_ON_ONCE(list_empty(&data.wq.entry));
>> +
>> +		if (!has_sleeper &&
>> +		    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
>> +			finish_wait(&rqw->wait, &data.wq);
>> +
>> +			/*
>> +			 * We raced with wbt_wake_function() getting a token,
>> +			 * which means we now have two. Put ours and wake
>> +			 * anyone else potentially waiting for one.
>> +			 */
>> +			if (test_bit(0, &data.flags))
>> +				wbt_rqw_done(rwb, rqw, wb_acct);
>>  			break;
> 
> Just use 'bool' variable should be OK 
> After finish_wait, no one could race with us here.
> 
>> +		}
>>  
>>  		if (lock) {
>>  			spin_unlock_irq(lock);
>> @@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>>  			spin_lock_irq(lock);
>>  		} else
>>  			io_schedule();
>> +
>>  		has_sleeper = false;
>>  	} while (1);
> 
> I cannot get the point of "since we can't rely on just being woken from the ->func handler
> we set".
> Do you mean there could be someone else could wake up this task ?
> 

If we do need a recheck after the io_schedule, we could do as following:

static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
		       unsigned long rw, spinlock_t *lock)
	__releases(lock)
	__acquires(lock)
{
	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
	struct wbt_wait_data data = {
		.wq = {
			.func	= wbt_wake_function,
			.entry	= LIST_HEAD_INIT(data.wq.entry),
		},
		.curr = current,
		.rwb = rwb,
		.rqw = rqw,
		.rw = rw,
	};
	bool has_sleeper;
	bool got = false;

retry:
	has_sleeper = wq_has_sleeper(&rqw->wait);
	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
		return;

	prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);

	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
		got = true;
		goto out;
	}

	if (lock) {
		spin_unlock_irq(lock);
		io_schedule();
		spin_lock_irq(lock);
	} else
		io_schedule();

out:
	finish_wait(&rqw->wait, &data.wq);

	/*
	 * We raced with wbt_wake_function() getting a token,
	 * which means we now have two. Put ours and wake
	 * anyone else potentially waiting for one.
	 */
	if (data.got && got)
		wbt_rqw_done(rwb, rqw, wb_acct);
	else if (!data.got && !got)
		goto retry;

Thanks
Jianchao


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-27  6:15                     ` jianchao.wang
@ 2018-08-27 14:51                       ` Jens Axboe
  2018-08-28  2:52                         ` jianchao.wang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2018-08-27 14:51 UTC (permalink / raw)
  To: jianchao.wang, Anchal Agarwal; +Cc: fllinden, linux-block, linux-kernel

On 8/27/18 12:15 AM, jianchao.wang wrote:
> 
> 
> On 08/27/2018 11:52 AM, jianchao.wang wrote:
>> Hi Jens
>>
>> On 08/25/2018 11:41 PM, Jens Axboe wrote:
>>>  	do {
>>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>>> +		if (test_bit(0, &data.flags))
>>> +			break;
>>>  
>>> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>>> +		WARN_ON_ONCE(list_empty(&data.wq.entry));
>>> +
>>> +		if (!has_sleeper &&
>>> +		    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
>>> +			finish_wait(&rqw->wait, &data.wq);
>>> +
>>> +			/*
>>> +			 * We raced with wbt_wake_function() getting a token,
>>> +			 * which means we now have two. Put ours and wake
>>> +			 * anyone else potentially waiting for one.
>>> +			 */
>>> +			if (test_bit(0, &data.flags))
>>> +				wbt_rqw_done(rwb, rqw, wb_acct);
>>>  			break;
>>
>> Just use 'bool' variable should be OK 
>> After finish_wait, no one could race with us here.
>>
>>> +		}
>>>  
>>>  		if (lock) {
>>>  			spin_unlock_irq(lock);
>>> @@ -511,11 +569,11 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
>>>  			spin_lock_irq(lock);
>>>  		} else
>>>  			io_schedule();
>>> +
>>>  		has_sleeper = false;
>>>  	} while (1);
>>
>> I cannot get the point of "since we can't rely on just being woken from the ->func handler
>> we set".
>> Do you mean there could be someone else could wake up this task ?

Yeah, you don't know for a fact that the wbt wait queue is the only
guy waking us up. Any sleep like this needs a loop. It was quite
easy to reproduce for me, and as expected, you'll get list corruption
on the wait queue since we leave it on the list and the stack goes
away.

> If we do need a recheck after the io_schedule, we could do as following:
> 
> static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
> 		       unsigned long rw, spinlock_t *lock)
> 	__releases(lock)
> 	__acquires(lock)
> {
> 	struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
> 	struct wbt_wait_data data = {
> 		.wq = {
> 			.func	= wbt_wake_function,
> 			.entry	= LIST_HEAD_INIT(data.wq.entry),
> 		},
> 		.curr = current,
> 		.rwb = rwb,
> 		.rqw = rqw,
> 		.rw = rw,
> 	};
> 	bool has_sleeper;
> 	bool got = false;
> 
> retry:
> 	has_sleeper = wq_has_sleeper(&rqw->wait);
> 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
> 		return;
> 
> 	prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
> 
> 	if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
> 		got = true;
> 		goto out;
> 	}
> 
> 	if (lock) {
> 		spin_unlock_irq(lock);
> 		io_schedule();
> 		spin_lock_irq(lock);
> 	} else
> 		io_schedule();
> 
> out:
> 	finish_wait(&rqw->wait, &data.wq);
> 
> 	/*
> 	 * We raced with wbt_wake_function() getting a token,
> 	 * which means we now have two. Put ours and wake
> 	 * anyone else potentially waiting for one.
> 	 */
> 	if (data.got && got)
> 		wbt_rqw_done(rwb, rqw, wb_acct);
> 	else if (!data.got && !got)
> 		goto retry;

I think the other variant is cleaner and easier to read. This is just
a natural loop, I don't think we need to use goto's here.

FWIW, I split it into two patches, current version is here:

http://git.kernel.dk/cgit/linux-block/log/?h=for-linus

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-27  3:52                   ` jianchao.wang
  2018-08-27  6:15                     ` jianchao.wang
@ 2018-08-27 15:37                     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2018-08-27 15:37 UTC (permalink / raw)
  To: jianchao.wang, Anchal Agarwal; +Cc: fllinden, linux-block, linux-kernel

On 8/26/18 9:52 PM, jianchao.wang wrote:
> Hi Jens
> 
> On 08/25/2018 11:41 PM, Jens Axboe wrote:
>>  	do {
>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>> +		if (test_bit(0, &data.flags))
>> +			break;
>>  
>> -		if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
>> +		WARN_ON_ONCE(list_empty(&data.wq.entry));
>> +
>> +		if (!has_sleeper &&
>> +		    rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
>> +			finish_wait(&rqw->wait, &data.wq);
>> +
>> +			/*
>> +			 * We raced with wbt_wake_function() getting a token,
>> +			 * which means we now have two. Put ours and wake
>> +			 * anyone else potentially waiting for one.
>> +			 */
>> +			if (test_bit(0, &data.flags))
>> +				wbt_rqw_done(rwb, rqw, wb_acct);
>>  			break;
> 
> Just use 'bool' variable should be OK 
> After finish_wait, no one could race with us here.

I was mostly concerned with the check post prep_to_wait(), but since the
spin unlock should include a barrier, the bool should actually be enough.
I'll change that back once I test.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done
  2018-08-27 14:51                       ` Jens Axboe
@ 2018-08-28  2:52                         ` jianchao.wang
  0 siblings, 0 replies; 22+ messages in thread
From: jianchao.wang @ 2018-08-28  2:52 UTC (permalink / raw)
  To: Jens Axboe, Anchal Agarwal; +Cc: fllinden, linux-block, linux-kernel

Hi Jens

On 08/27/2018 10:51 PM, Jens Axboe wrote:
>>> I cannot get the point of "since we can't rely on just being woken from the ->func handler
>>> we set".
>>> Do you mean there could be someone else could wake up this task ?
> Yeah, you don't know for a fact that the wbt wait queue is the only
> guy waking us up. Any sleep like this needs a loop. It was quite
> easy to reproduce for me, and as expected, you'll get list corruption
> on the wait queue since we leave it on the list and the stack goes
> away.
> 
Yes, got it. Thanks for your kindly response. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-08-28  2:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23 13:08 [PATCH] blk-wbt: get back the missed wakeup from __wbt_done Jianchao Wang
2018-08-23 15:37 ` Jens Axboe
2018-08-23 16:24   ` van der Linden, Frank
     [not found]     ` <20180823210144.GB5624@kaos-source-ops-60001.pdx1.amazon.com>
2018-08-23 23:03       ` Jens Axboe
2018-08-23 23:14         ` Jens Axboe
2018-08-24  5:55           ` jianchao.wang
2018-08-24 16:40             ` van der Linden, Frank
2018-08-24 16:44               ` Jens Axboe
     [not found]         ` <20180824181223.GA9049@kaos-source-ops-60001.pdx1.amazon.com>
2018-08-24 18:50           ` Jens Axboe
     [not found]             ` <20180824203305.GA4690@kaos-source-ops-60001.pdx1.amazon.com>
2018-08-24 20:41               ` Jens Axboe
2018-08-25 15:41                 ` Jens Axboe
2018-08-27  3:52                   ` jianchao.wang
2018-08-27  6:15                     ` jianchao.wang
2018-08-27 14:51                       ` Jens Axboe
2018-08-28  2:52                         ` jianchao.wang
2018-08-27 15:37                     ` Jens Axboe
2018-08-23 15:42 ` Jens Axboe
2018-08-24  2:06   ` jianchao.wang
2018-08-24 14:40     ` Jens Axboe
2018-08-24 14:58       ` Jens Axboe
2018-08-24 17:14         ` Eduardo Valentin
2018-08-24 17:17           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).