[PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
@ 2018-05-19  7:44 Ming Lei
  2018-05-22 20:20 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ming Lei @ 2018-05-19  7:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei, stable, Omar Sandoval

When the allocation process is scheduled back and the mapped hw queue is
changed, do one extra wake up on orignal queue for compensating wake up
miss, so other allocations on the orignal queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

Cc: <stable@vger.kernel.org>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---

V2:
	fix build failure

 block/blk-mq-tag.c      | 13 +++++++++++++
 include/linux/sbitmap.h |  7 +++++++
 lib/sbitmap.c           |  6 ++++++
 3 files changed, 26 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..77607f89d205 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	ws = bt_wait_ptr(bt, data->hctx);
 	drop_ctx = data->ctx == NULL;
 	do {
+		struct sbitmap_queue *bt_orig;
+
 		/*
 		 * We're out of tags on this hardware queue, kick any
 		 * pending IO submits before going to sleep waiting for
@@ -159,6 +161,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (data->ctx)
 			blk_mq_put_ctx(data->ctx);
 
+		bt_orig = bt;
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
@@ -170,6 +173,16 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			bt = &tags->bitmap_tags;
 
 		finish_wait(&ws->wait, &wait);
+
+		/*
+		 * If destination hw queue is changed, wake up original
+		 * queue one extra time for compensating the wake up
+		 * miss, so other allocations on original queue won't
+		 * be starved.
+		 */
+		if (bt != bt_orig)
+			sbitmap_queue_wake_up(bt_orig);
+
 		ws = bt_wait_ptr(bt, data->hctx);
 	} while (1);
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 841585f6e5f2..b23f50355281 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
 void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
 
 /**
+ * sbitmap_wake_up() - Do a regular wake up compensation if the queue
+ * allocated from is changed after scheduling back.
+ * @sbq: Bitmap queue to wake up.
+ */
+void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
+
+/**
  * sbitmap_queue_show() - Dump &struct sbitmap_queue information to a &struct
  * seq_file.
  * @sbq: Bitmap queue to show.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index e6a9c06ec70c..c6ae4206bcb1 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -466,6 +466,12 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
 	}
 }
 
+void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
+{
+	sbq_wake_up(sbq);
+}
+EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
+
 void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-19  7:44 [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates Ming Lei
@ 2018-05-22 20:20 ` Jens Axboe
  2018-05-23  0:22   ` Ming Lei
  2018-05-23  3:59 ` Jens Axboe
  2018-05-23 17:40 ` Omar Sandoval
  2 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-05-22 20:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, stable, Omar Sandoval

On 5/19/18 1:44 AM, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, do one extra wake up on orignal queue for compensating wake up
> miss, so other allocations on the orignal queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.

Can you explain what the actual bug is? The commit message doesn't
really say, it just explains what the code change does. What wake up
miss?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-22 20:20 ` Jens Axboe
@ 2018-05-23  0:22   ` Ming Lei
  2018-05-23  3:35     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-05-23  0:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, stable, Omar Sandoval

On Tue, May 22, 2018 at 02:20:37PM -0600, Jens Axboe wrote:
> On 5/19/18 1:44 AM, Ming Lei wrote:
> > When the allocation process is scheduled back and the mapped hw queue is
> > changed, do one extra wake up on orignal queue for compensating wake up
> > miss, so other allocations on the orignal queue won't be starved.
> > 
> > This patch fixes one request allocation hang issue, which can be
> > triggered easily in case of very low nr_request.
> 
> Can you explain what the actual bug is? The commit message doesn't
> really say, it just explains what the code change does. What wake up
> miss?

For example:

1) one request queue has 2 queues, and queue depth is low, so wake_batch
is set 1 or very small.

2) There are 2 allocations which are blocked on hw queue 0, now the last
request mapped to hw queue 0 is completed, then sbq_wake_up() wakes only
one of 2 blockers.

3) the waken up allocation process is migrated to another CPU, and its
hw queue becomes mapped to hw queue 1, and this allocation is switched
to hw queue 1

4) so there isn't any chance for another blocked allocation to move one,
since all request in hw queue 0 has been completed. That is why I call
it wake up miss.

BTW, this issue can be reproduced reliably by the block/020 I posted
yesterday:

https://marc.info/?l=linux-block&m=152698758418536&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23  0:22   ` Ming Lei
@ 2018-05-23  3:35     ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2018-05-23  3:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, stable, Omar Sandoval

On 5/22/18 6:22 PM, Ming Lei wrote:
> On Tue, May 22, 2018 at 02:20:37PM -0600, Jens Axboe wrote:
>> On 5/19/18 1:44 AM, Ming Lei wrote:
>>> When the allocation process is scheduled back and the mapped hw queue is
>>> changed, do one extra wake up on orignal queue for compensating wake up
>>> miss, so other allocations on the orignal queue won't be starved.
>>>
>>> This patch fixes one request allocation hang issue, which can be
>>> triggered easily in case of very low nr_request.
>>
>> Can you explain what the actual bug is? The commit message doesn't
>> really say, it just explains what the code change does. What wake up
>> miss?
> 
> For example:
> 
> 1) one request queue has 2 queues, and queue depth is low, so wake_batch
> is set 1 or very small.

We have too many 'queues' :)

> 2) There are 2 allocations which are blocked on hw queue 0, now the last
> request mapped to hw queue 0 is completed, then sbq_wake_up() wakes only
> one of 2 blockers.
> 
> 3) the waken up allocation process is migrated to another CPU, and its
> hw queue becomes mapped to hw queue 1, and this allocation is switched
> to hw queue 1
> 
> 4) so there isn't any chance for another blocked allocation to move one,
> since all request in hw queue 0 has been completed. That is why I call
> it wake up miss.

OK, that makes sense to me. With that, let me review your patch in
a different email.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-19  7:44 [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates Ming Lei
  2018-05-22 20:20 ` Jens Axboe
@ 2018-05-23  3:59 ` Jens Axboe
  2018-05-23  9:32   ` Ming Lei
  2018-05-23 17:40 ` Omar Sandoval
  2 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-05-23  3:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Omar Sandoval

On 5/19/18 1:44 AM, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, do one extra wake up on orignal queue for compensating wake up
> miss, so other allocations on the orignal queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.

Trying to think of better ways we can fix this, but I don't see
any right now. Getting rid of the wake_up_nr() kills us on tons
of tasks waiting. Maybe it might be possible to only go through
the fake wakeup IFF we have a task waiting on the list, that'd
spare us the atomic dec and cmpxchg for all cases except if we
have a task (or more) waiting on the existing wait state.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..77607f89d205 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	ws = bt_wait_ptr(bt, data->hctx);
>  	drop_ctx = data->ctx == NULL;
>  	do {
> +		struct sbitmap_queue *bt_orig;

This should be called 'bt_prev'.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 841585f6e5f2..b23f50355281 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
>  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
>  
>  /**
> + * sbitmap_wake_up() - Do a regular wake up compensation if the queue
> + * allocated from is changed after scheduling back.
> + * @sbq: Bitmap queue to wake up.
> + */
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);

The blk-mq issue is bleeding into sbitmap here. This should just detail
that this issues a wakeup, similar to how freeing a tag would

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23  3:59 ` Jens Axboe
@ 2018-05-23  9:32   ` Ming Lei
  2018-05-23 17:48     ` Omar Sandoval
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-05-23  9:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
> On 5/19/18 1:44 AM, Ming Lei wrote:
> > When the allocation process is scheduled back and the mapped hw queue is
> > changed, do one extra wake up on orignal queue for compensating wake up
> > miss, so other allocations on the orignal queue won't be starved.
> > 
> > This patch fixes one request allocation hang issue, which can be
> > triggered easily in case of very low nr_request.
> 
> Trying to think of better ways we can fix this, but I don't see
> any right now. Getting rid of the wake_up_nr() kills us on tons
> of tasks waiting. 

I am not sure if I understand your point, but this issue isn't related
with wake_up_nr() actually, and it can be reproduced after reverting
4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).

All tasks in current sbq_wait_state may be scheduled to other CPUs, and
there may still be tasks waiting for allocation from this sbitmap_queue,
and the root cause is about cross-queue allocation, as you said,
there are too many queues, :-)

> Maybe it might be possible to only go through
> the fake wakeup IFF we have a task waiting on the list, that'd
> spare us the atomic dec and cmpxchg for all cases except if we
> have a task (or more) waiting on the existing wait state.

sbq_wake_ptr() checks if there are waiting tasks, and it will do
nothing if there isn't any.

In theory, the fake wakeup is only needed when the current wakeup
is triggered by the last in-flight request, but it isn't cheap
to figure out that accurately.

Given it is quite unusual to schedule process from one CPU to another,
and the cost of process migration can be much bigger than the wakeup,
I guess we may not need to worry about the performance effect.

> 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 336dde07b230..77607f89d205 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >  	ws = bt_wait_ptr(bt, data->hctx);
> >  	drop_ctx = data->ctx == NULL;
> >  	do {
> > +		struct sbitmap_queue *bt_orig;
> 
> This should be called 'bt_prev'.

OK.

> 
> > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> > index 841585f6e5f2..b23f50355281 100644
> > --- a/include/linux/sbitmap.h
> > +++ b/include/linux/sbitmap.h
> > @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
> >  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
> >  
> >  /**
> > + * sbitmap_wake_up() - Do a regular wake up compensation if the queue
> > + * allocated from is changed after scheduling back.
> > + * @sbq: Bitmap queue to wake up.
> > + */
> > +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
> 
> The blk-mq issue is bleeding into sbitmap here. This should just detail
> that this issues a wakeup, similar to how freeing a tag would

Right, will fix it in V3.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-19  7:44 [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates Ming Lei
  2018-05-22 20:20 ` Jens Axboe
  2018-05-23  3:59 ` Jens Axboe
@ 2018-05-23 17:40 ` Omar Sandoval
  2018-05-24  2:20   ` Ming Lei
  2 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2018-05-23 17:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, stable, Omar Sandoval

On Sat, May 19, 2018 at 03:44:06PM +0800, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, do one extra wake up on orignal queue for compensating wake up
> miss, so other allocations on the orignal queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.
> 
> Cc: <stable@vger.kernel.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> 
> V2:
> 	fix build failure
> 
>  block/blk-mq-tag.c      | 13 +++++++++++++
>  include/linux/sbitmap.h |  7 +++++++
>  lib/sbitmap.c           |  6 ++++++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..77607f89d205 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	ws = bt_wait_ptr(bt, data->hctx);
>  	drop_ctx = data->ctx == NULL;
>  	do {
> +		struct sbitmap_queue *bt_orig;
> +
>  		/*
>  		 * We're out of tags on this hardware queue, kick any
>  		 * pending IO submits before going to sleep waiting for
> @@ -159,6 +161,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  		if (data->ctx)
>  			blk_mq_put_ctx(data->ctx);
>  
> +		bt_orig = bt;
>  		io_schedule();
>  
>  		data->ctx = blk_mq_get_ctx(data->q);
> @@ -170,6 +173,16 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  			bt = &tags->bitmap_tags;
>  
>  		finish_wait(&ws->wait, &wait);
> +
> +		/*
> +		 * If destination hw queue is changed, wake up original
> +		 * queue one extra time for compensating the wake up
> +		 * miss, so other allocations on original queue won't
> +		 * be starved.
> +		 */
> +		if (bt != bt_orig)
> +			sbitmap_queue_wake_up(bt_orig);
> +
>  		ws = bt_wait_ptr(bt, data->hctx);
>  	} while (1);
>  
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 841585f6e5f2..b23f50355281 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
>  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
>  
>  /**
> + * sbitmap_wake_up() - Do a regular wake up compensation if the queue
> + * allocated from is changed after scheduling back.
> + * @sbq: Bitmap queue to wake up.
> + */
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
> +
> +/**
>   * sbitmap_queue_show() - Dump &struct sbitmap_queue information to a &struct
>   * seq_file.
>   * @sbq: Bitmap queue to show.
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index e6a9c06ec70c..c6ae4206bcb1 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -466,6 +466,12 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
>  	}
>  }
>  
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
> +{
> +	sbq_wake_up(sbq);
> +}
> +EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);

There's no reason to wrap an internal helper with a function taking the
same arguments, just rename sbq_wake_up() and export it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23  9:32   ` Ming Lei
@ 2018-05-23 17:48     ` Omar Sandoval
  2018-05-23 22:09       ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2018-05-23 17:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Omar Sandoval

On Wed, May 23, 2018 at 05:32:31PM +0800, Ming Lei wrote:
> On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
> > On 5/19/18 1:44 AM, Ming Lei wrote:
> > > When the allocation process is scheduled back and the mapped hw queue is
> > > changed, do one extra wake up on orignal queue for compensating wake up
> > > miss, so other allocations on the orignal queue won't be starved.
> > > 
> > > This patch fixes one request allocation hang issue, which can be
> > > triggered easily in case of very low nr_request.
> > 
> > Trying to think of better ways we can fix this, but I don't see
> > any right now. Getting rid of the wake_up_nr() kills us on tons
> > of tasks waiting. 
> 
> I am not sure if I understand your point, but this issue isn't related
> with wake_up_nr() actually, and it can be reproduced after reverting
> 4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).
> 
> All tasks in current sbq_wait_state may be scheduled to other CPUs, and
> there may still be tasks waiting for allocation from this sbitmap_queue,
> and the root cause is about cross-queue allocation, as you said,
> there are too many queues, :-)

I don't follow. Your description of the problem was that we have two
waiters and only wake up one, which doesn't in turn allocate and free a
tag and wake up the second waiter. Changing it back to wake_up_nr()
eliminates that problem. And if waking up everything doesn't fix it, how
does your fix of waking up a few extra tasks fix it?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23 17:48     ` Omar Sandoval
@ 2018-05-23 22:09       ` Ming Lei
  2018-05-23 22:19         ` Jens Axboe
  2018-05-23 22:47         ` Omar Sandoval
  0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2018-05-23 22:09 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Ming Lei, Jens Axboe, linux-block, Omar Sandoval

On Thu, May 24, 2018 at 1:48 AM, Omar Sandoval <osandov@osandov.com> wrote:
> On Wed, May 23, 2018 at 05:32:31PM +0800, Ming Lei wrote:
>> On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
>> > On 5/19/18 1:44 AM, Ming Lei wrote:
>> > > When the allocation process is scheduled back and the mapped hw queue is
>> > > changed, do one extra wake up on orignal queue for compensating wake up
>> > > miss, so other allocations on the orignal queue won't be starved.
>> > >
>> > > This patch fixes one request allocation hang issue, which can be
>> > > triggered easily in case of very low nr_request.
>> >
>> > Trying to think of better ways we can fix this, but I don't see
>> > any right now. Getting rid of the wake_up_nr() kills us on tons
>> > of tasks waiting.
>>
>> I am not sure if I understand your point, but this issue isn't related
>> with wake_up_nr() actually, and it can be reproduced after reverting
>> 4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).
>>
>> All tasks in current sbq_wait_state may be scheduled to other CPUs, and
>> there may still be tasks waiting for allocation from this sbitmap_queue,
>> and the root cause is about cross-queue allocation, as you said,
>> there are too many queues, :-)
>
> I don't follow. Your description of the problem was that we have two
> waiters and only wake up one, which doesn't in turn allocate and free a
> tag and wake up the second waiter. Changing it back to wake_up_nr()
> eliminates that problem. And if waking up everything doesn't fix it, how
> does your fix of waking up a few extra tasks fix it?

What matters is that this patch wakes up the previous sbq, let's see if
from another view:

1) still 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two waiters
of 3 are waken up because of wake_batch, but both the two waiters can be
scheduled to another CPU and cause to switch to hw queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
is in hw queue
0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to another
hw queue

The issue can be understood a bit easier if we just forget sbq_wait_state and
focus on sbq, :-)

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23 22:09       ` Ming Lei
@ 2018-05-23 22:19         ` Jens Axboe
  2018-05-23 22:56           ` Ming Lei
  2018-05-23 22:47         ` Omar Sandoval
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-05-23 22:19 UTC (permalink / raw)
  To: Ming Lei, Omar Sandoval; +Cc: Ming Lei, linux-block, Omar Sandoval

On 5/23/18 4:09 PM, Ming Lei wrote:
> On Thu, May 24, 2018 at 1:48 AM, Omar Sandoval <osandov@osandov.com> wrote:
>> On Wed, May 23, 2018 at 05:32:31PM +0800, Ming Lei wrote:
>>> On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
>>>> On 5/19/18 1:44 AM, Ming Lei wrote:
>>>>> When the allocation process is scheduled back and the mapped hw queue is
>>>>> changed, do one extra wake up on orignal queue for compensating wake up
>>>>> miss, so other allocations on the orignal queue won't be starved.
>>>>>
>>>>> This patch fixes one request allocation hang issue, which can be
>>>>> triggered easily in case of very low nr_request.
>>>>
>>>> Trying to think of better ways we can fix this, but I don't see
>>>> any right now. Getting rid of the wake_up_nr() kills us on tons
>>>> of tasks waiting.
>>>
>>> I am not sure if I understand your point, but this issue isn't related
>>> with wake_up_nr() actually, and it can be reproduced after reverting
>>> 4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).
>>>
>>> All tasks in current sbq_wait_state may be scheduled to other CPUs, and
>>> there may still be tasks waiting for allocation from this sbitmap_queue,
>>> and the root cause is about cross-queue allocation, as you said,
>>> there are too many queues, :-)
>>
>> I don't follow. Your description of the problem was that we have two
>> waiters and only wake up one, which doesn't in turn allocate and free a
>> tag and wake up the second waiter. Changing it back to wake_up_nr()
>> eliminates that problem. And if waking up everything doesn't fix it, how
>> does your fix of waking up a few extra tasks fix it?
> 
> What matters is that this patch wakes up the previous sbq, let's see if
> from another view:
> 
> 1) still 2 hw queues, nr_requests are 2, and wake_batch is one
> 
> 2) there are 3 waiters on hw queue 0
> 
> 3) two in-flight requests in hw queue 0 are completed, and only two waiters
> of 3 are waken up because of wake_batch, but both the two waiters can be
> scheduled to another CPU and cause to switch to hw queue 1
> 
> 4) then the 3rd waiter will wait for ever, since no in-flight request
> is in hw queue
> 0 any more.
> 
> 5) this patch fixes it by the fake wakeup when waiter is scheduled to another
> hw queue
> 
> The issue can be understood a bit easier if we just forget sbq_wait_state and
> focus on sbq, :-)

It makes sense to me, and also explains why wake_up() vs wake_up_nr() doesn't
matter. Which is actually a relief. And the condition of moving AND having
a waiter should be rare enough that it'll work out fine in practice, I don't
see any performance implications from this. You're right that we already
abort early if we don't have pending waiters, so it's all good.

Can you respin with the comments from Omar and myself covered?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23 22:09       ` Ming Lei
  2018-05-23 22:19         ` Jens Axboe
@ 2018-05-23 22:47         ` Omar Sandoval
  1 sibling, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2018-05-23 22:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Ming Lei, Jens Axboe, linux-block, Omar Sandoval

On Thu, May 24, 2018 at 06:09:51AM +0800, Ming Lei wrote:
> On Thu, May 24, 2018 at 1:48 AM, Omar Sandoval <osandov@osandov.com> wrote:
> > On Wed, May 23, 2018 at 05:32:31PM +0800, Ming Lei wrote:
> >> On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
> >> > On 5/19/18 1:44 AM, Ming Lei wrote:
> >> > > When the allocation process is scheduled back and the mapped hw queue is
> >> > > changed, do one extra wake up on orignal queue for compensating wake up
> >> > > miss, so other allocations on the orignal queue won't be starved.
> >> > >
> >> > > This patch fixes one request allocation hang issue, which can be
> >> > > triggered easily in case of very low nr_request.
> >> >
> >> > Trying to think of better ways we can fix this, but I don't see
> >> > any right now. Getting rid of the wake_up_nr() kills us on tons
> >> > of tasks waiting.
> >>
> >> I am not sure if I understand your point, but this issue isn't related
> >> with wake_up_nr() actually, and it can be reproduced after reverting
> >> 4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).
> >>
> >> All tasks in current sbq_wait_state may be scheduled to other CPUs, and
> >> there may still be tasks waiting for allocation from this sbitmap_queue,
> >> and the root cause is about cross-queue allocation, as you said,
> >> there are too many queues, :-)
> >
> > I don't follow. Your description of the problem was that we have two
> > waiters and only wake up one, which doesn't in turn allocate and free a
> > tag and wake up the second waiter. Changing it back to wake_up_nr()
> > eliminates that problem. And if waking up everything doesn't fix it, how
> > does your fix of waking up a few extra tasks fix it?
> 
> What matters is that this patch wakes up the previous sbq, let's see if
> from another view:
> 
> 1) still 2 hw queues, nr_requests are 2, and wake_batch is one
> 
> 2) there are 3 waiters on hw queue 0
> 
> 3) two in-flight requests in hw queue 0 are completed, and only two waiters
> of 3 are waken up because of wake_batch, but both the two waiters can be
> scheduled to another CPU and cause to switch to hw queue 1
> 
> 4) then the 3rd waiter will wait for ever, since no in-flight request
> is in hw queue
> 0 any more.
> 
> 5) this patch fixes it by the fake wakeup when waiter is scheduled to another
> hw queue
> 
> The issue can be understood a bit easier if we just forget sbq_wait_state and
> focus on sbq, :-)

Okay, I see, although if I'm understanding correctly, it has everything
to do with sbq_wait_state. If we only had one waitqueue, then wake_up()
would always wake up all of the waiters, but because we have them spread
out over multiple waitqueues, we have to call sbq_wake_up()/sbitmap_queue_wake_up()
to do the wake up on the other waitqueue.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23 22:19         ` Jens Axboe
@ 2018-05-23 22:56           ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2018-05-23 22:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Ming Lei, linux-block, Omar Sandoval

[-- Attachment #1: Type: text/plain, Size: 3161 bytes --]

Jens Axboe <axboe@kernel.dk> 于 2018年5月24日周四 上午6:19写道：

> On 5/23/18 4:09 PM, Ming Lei wrote:
> > On Thu, May 24, 2018 at 1:48 AM, Omar Sandoval <osandov@osandov.com>
> wrote:
> >> On Wed, May 23, 2018 at 05:32:31PM +0800, Ming Lei wrote:
> >>> On Tue, May 22, 2018 at 09:59:17PM -0600, Jens Axboe wrote:
> >>>> On 5/19/18 1:44 AM, Ming Lei wrote:
> >>>>> When the allocation process is scheduled back and the mapped hw
> queue is
> >>>>> changed, do one extra wake up on orignal queue for compensating wake
> up
> >>>>> miss, so other allocations on the orignal queue won't be starved.
> >>>>>
> >>>>> This patch fixes one request allocation hang issue, which can be
> >>>>> triggered easily in case of very low nr_request.
> >>>>
> >>>> Trying to think of better ways we can fix this, but I don't see
> >>>> any right now. Getting rid of the wake_up_nr() kills us on tons
> >>>> of tasks waiting.
> >>>
> >>> I am not sure if I understand your point, but this issue isn't related
> >>> with wake_up_nr() actually, and it can be reproduced after reverting
> >>> 4e5dff41be7b5201c1c47c (blk-mq: improve heavily contended tag case).
> >>>
> >>> All tasks in current sbq_wait_state may be scheduled to other CPUs, and
> >>> there may still be tasks waiting for allocation from this
> sbitmap_queue,
> >>> and the root cause is about cross-queue allocation, as you said,
> >>> there are too many queues, :-)
> >>
> >> I don't follow. Your description of the problem was that we have two
> >> waiters and only wake up one, which doesn't in turn allocate and free a
> >> tag and wake up the second waiter. Changing it back to wake_up_nr()
> >> eliminates that problem. And if waking up everything doesn't fix it, how
> >> does your fix of waking up a few extra tasks fix it?
> >
> > What matters is that this patch wakes up the previous sbq, let's see if
> > from another view:
> >
> > 1) still 2 hw queues, nr_requests are 2, and wake_batch is one
> >
> > 2) there are 3 waiters on hw queue 0
> >
> > 3) two in-flight requests in hw queue 0 are completed, and only two
> waiters
> > of 3 are waken up because of wake_batch, but both the two waiters can be
> > scheduled to another CPU and cause to switch to hw queue 1
> >
> > 4) then the 3rd waiter will wait for ever, since no in-flight request
> > is in hw queue
> > 0 any more.
> >
> > 5) this patch fixes it by the fake wakeup when waiter is scheduled to
> another
> > hw queue
> >
> > The issue can be understood a bit easier if we just forget
> sbq_wait_state and
> > focus on sbq, :-)
>
> It makes sense to me, and also explains why wake_up() vs wake_up_nr()
> doesn't
> matter. Which is actually a relief. And the condition of moving AND having
> a waiter should be rare enough that it'll work out fine in practice, I
> don't
> see any performance implications from this. You're right that we already
> abort early if we don't have pending waiters, so it's all good.
>
> Can you respin with the comments from Omar and myself covered?
>

OK, will do it after returning from outside.


> --
> Jens Axboe
>
>

[-- Attachment #2: Type: text/html, Size: 4236 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates
  2018-05-23 17:40 ` Omar Sandoval
@ 2018-05-24  2:20   ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2018-05-24  2:20 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Ming Lei, Jens Axboe, linux-block, stable, Omar Sandoval

On Thu, May 24, 2018 at 1:40 AM, Omar Sandoval <osandov@osandov.com> wrote:
> On Sat, May 19, 2018 at 03:44:06PM +0800, Ming Lei wrote:
>> When the allocation process is scheduled back and the mapped hw queue is
>> changed, do one extra wake up on orignal queue for compensating wake up
>> miss, so other allocations on the orignal queue won't be starved.
>>
>> This patch fixes one request allocation hang issue, which can be
>> triggered easily in case of very low nr_request.
>>
>> Cc: <stable@vger.kernel.org>
>> Cc: Omar Sandoval <osandov@fb.com>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> ---
>>
>> V2:
>>       fix build failure
>>
>>  block/blk-mq-tag.c      | 13 +++++++++++++
>>  include/linux/sbitmap.h |  7 +++++++
>>  lib/sbitmap.c           |  6 ++++++
>>  3 files changed, 26 insertions(+)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 336dde07b230..77607f89d205 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>>       ws = bt_wait_ptr(bt, data->hctx);
>>       drop_ctx = data->ctx == NULL;
>>       do {
>> +             struct sbitmap_queue *bt_orig;
>> +
>>               /*
>>                * We're out of tags on this hardware queue, kick any
>>                * pending IO submits before going to sleep waiting for
>> @@ -159,6 +161,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>>               if (data->ctx)
>>                       blk_mq_put_ctx(data->ctx);
>>
>> +             bt_orig = bt;
>>               io_schedule();
>>
>>               data->ctx = blk_mq_get_ctx(data->q);
>> @@ -170,6 +173,16 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>>                       bt = &tags->bitmap_tags;
>>
>>               finish_wait(&ws->wait, &wait);
>> +
>> +             /*
>> +              * If destination hw queue is changed, wake up original
>> +              * queue one extra time for compensating the wake up
>> +              * miss, so other allocations on original queue won't
>> +              * be starved.
>> +              */
>> +             if (bt != bt_orig)
>> +                     sbitmap_queue_wake_up(bt_orig);
>> +
>>               ws = bt_wait_ptr(bt, data->hctx);
>>       } while (1);
>>
>> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
>> index 841585f6e5f2..b23f50355281 100644
>> --- a/include/linux/sbitmap.h
>> +++ b/include/linux/sbitmap.h
>> @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
>>  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
>>
>>  /**
>> + * sbitmap_wake_up() - Do a regular wake up compensation if the queue
>> + * allocated from is changed after scheduling back.
>> + * @sbq: Bitmap queue to wake up.
>> + */
>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
>> +
>> +/**
>>   * sbitmap_queue_show() - Dump &struct sbitmap_queue information to a &struct
>>   * seq_file.
>>   * @sbq: Bitmap queue to show.
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index e6a9c06ec70c..c6ae4206bcb1 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -466,6 +466,12 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
>>       }
>>  }
>>
>> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
>> +{
>> +     sbq_wake_up(sbq);
>> +}
>> +EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
>
> There's no reason to wrap an internal helper with a function taking the
> same arguments, just rename sbq_wake_up() and export it.

This way may keep sbq_wake_up() inlined to sbitmap_queue_clear(), so
we won't introduce any cost to fast path.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-05-24  2:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-19  7:44 [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates Ming Lei
2018-05-22 20:20 ` Jens Axboe
2018-05-23  0:22   ` Ming Lei
2018-05-23  3:35     ` Jens Axboe
2018-05-23  3:59 ` Jens Axboe
2018-05-23  9:32   ` Ming Lei
2018-05-23 17:48     ` Omar Sandoval
2018-05-23 22:09       ` Ming Lei
2018-05-23 22:19         ` Jens Axboe
2018-05-23 22:56           ` Ming Lei
2018-05-23 22:47         ` Omar Sandoval
2018-05-23 17:40 ` Omar Sandoval
2018-05-24  2:20   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.