All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] blk-mq: clean up the hctx restart
@ 2018-07-31  4:02 Jianchao Wang
  2018-07-31  4:58 ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Jianchao Wang @ 2018-07-31  4:02 UTC (permalink / raw)
  To: axboe, ming.lei, bart.vanassche; +Cc: linux-block, linux-kernel

Currently, we will always set SCHED_RESTART whenever there are
requests in hctx->dispatch, then when request is completed and
freed the hctx queues will be restarted to avoid IO hang. This
is unnecessary most of time. Especially when there are lots of
LUNs attached to one host, the RR restart loop could be very
expensive.

Requests will be queued on hctx dispath list due to:
 - flush and cpu hotplug cases
 - no driver tag w/ io scheduler
 - LLDD returns BLK_STS_RESOURCE/BLK_STS_DEV_RESOURCE
 - no driver budget
We needn't care about the 1st case, it is normal insert.
Regarding the 2nd case, if tags are shared, the sbitmap_queue
wakeup hook will work and hctx_may_queue will avoid starvation of
other ones, otherwise, hctx restart is employed, but it is done
by blk_mq_mark_tag_wait. We need hctx restart for the 3rd and 4th
case to rerun the hctx queue when some LLDD limits are lifted due
to in-flight requests are completed.

So we remove the blk_mq_sched_mark_restart_hctx from
blk_mq_sched_dispatch_requests and only use it when LLDD returns
BLK_STS_DEV_RESOURCE and no driver budget. In addition, there some
race conditions fixed in this patch. More details please refer to
comment of blk_mq_dispatch_rq_list.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
---
 block/blk-mq-sched.c |  7 ++---
 block/blk-mq-sched.h |  1 +
 block/blk-mq.c       | 82 +++++++++++++++++++++++++++++++---------------------
 3 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 56c493c..3270ad0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -54,7 +54,7 @@ void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio)
  * Mark a hardware queue as needing a restart. For shared queues, maintain
  * a count of how many hardware queues are marked for restart.
  */
-static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
+void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
 		return;
@@ -202,15 +202,13 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * we only ever dispatch a fraction of the requests available because
 	 * of low device queue depth. Once we pull requests out of the IO
 	 * scheduler, we can no longer merge or sort them. So it's best to
-	 * leave them there for as long as we can. Mark the hw queue as
-	 * needing a restart in that case.
+	 * leave them there for as long as we can.
 	 *
 	 * We want to dispatch from the scheduler if there was nothing
 	 * on the dispatch list or we were able to dispatch from the
 	 * dispatch list.
 	 */
 	if (!list_empty(&rq_list)) {
-		blk_mq_sched_mark_restart_hctx(hctx);
 		if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
 			if (has_sched_dispatch)
 				blk_mq_do_dispatch_sched(hctx);
@@ -417,7 +415,6 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 		 */
 		if (!atomic_read(&queue->shared_hctx_restart))
 			return;
-
 		rcu_read_lock();
 		list_for_each_entry_rcu_rr(q, queue, &set->tag_list,
 					   tag_set_list) {
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cb8f93..650cbfc 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,6 +32,7 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			    unsigned int hctx_idx);
+void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx);
 
 static inline bool
 blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45df734..b47627a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1086,6 +1086,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	bool marked = false;
 
 	if (list_empty(list))
 		return false;
@@ -1096,6 +1097,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * Now process all the entries, sending them to the driver.
 	 */
 	errors = queued = 0;
+retry:
 	do {
 		struct blk_mq_queue_data bd;
 
@@ -1115,12 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 */
 			if (!blk_mq_mark_tag_wait(&hctx, rq)) {
 				blk_mq_put_dispatch_budget(hctx);
-				/*
-				 * For non-shared tags, the RESTART check
-				 * will suffice.
-				 */
-				if (hctx->flags & BLK_MQ_F_TAG_SHARED)
-					no_tag = true;
+				no_tag = true;
 				break;
 			}
 		}
@@ -1172,42 +1169,61 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
-		bool needs_restart;
+		bool run_queue = false;
+
+		if (!no_tag && (ret != BLK_STS_RESOURCE) && !marked) {
+			blk_mq_sched_mark_restart_hctx(hctx);
+			marked = true;
+			/*
+			 * .queue_rq will put the budget, so we need to get it again.
+			 */
+			got_budget = false;
+			goto retry;
+		}
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
 		/*
-		 * If SCHED_RESTART was set by the caller of this function and
-		 * it is no longer set that means that it was cleared by another
-		 * thread and hence that a queue rerun is needed.
+		 * The reason for why we reach here could be:
+		 *  - No driver tag
+		 *    For the shared-tags case, the tag wakeup hook will work
+		 *    and hctx_may_queue will avoid starvation of other ones.
+		 *    For the non-shared-tags case, hctx restart is employed.
 		 *
-		 * If 'no_tag' is set, that means that we failed getting
-		 * a driver tag with an I/O scheduler attached. If our dispatch
-		 * waitqueue is no longer active, ensure that we run the queue
-		 * AFTER adding our entries back to the list.
+		 *    Because reqs have not been queued to hctx->dispatch list
+		 *    when blk_mq_mark_tag_wait, both of two conditions will be
+		 *    triggered w/ an empty hctx->dispatch and do nothing finally.
+		 *    So recheck them here and rerun the queue if needed.
 		 *
-		 * If no I/O scheduler has been configured it is possible that
-		 * the hardware queue got stopped and restarted before requests
-		 * were pushed back onto the dispatch list. Rerun the queue to
-		 * avoid starvation. Notes:
-		 * - blk_mq_run_hw_queue() checks whether or not a queue has
-		 *   been stopped before rerunning a queue.
-		 * - Some but not all block drivers stop a queue before
-		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
-		 *   and dm-rq.
-		 *
-		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
-		 * bit is set, run queue after a delay to avoid IO stalls
-		 * that could otherwise occur if the queue is idle.
+		 *  - BLK_STS_DEV_RESOURCE/no budget
+		 *    Employ the hctx restart mechanism to restart the queue.
+		 *    The LLDD resources could be released:
+		 *     - before mark SCHED_RESTART. We retry to dispatch to fix
+		 *       this case.
+		 *     - after mark SCHED_RESTART before queue requests on
+		 *       hctx->dispatch. we recheck the SCHED_RESTART, if not there,
+		 *       rerun the hctx.
+		 *  - BLK_STS_RESOURCE
+		 *    Run queue after a delay to avoid IO stalls.
+		 *    More details please refer to comment of BLK_STS_DEV_RESOURCE.
 		 */
-		needs_restart = blk_mq_sched_needs_restart(hctx);
-		if (!needs_restart ||
-		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
-			blk_mq_run_hw_queue(hctx, true);
-		else if (needs_restart && (ret == BLK_STS_RESOURCE))
-			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
+
+		if (no_tag) {
+			if (hctx->flags & BLK_MQ_F_TAG_SHARED)
+				run_queue = list_empty_careful(&hctx->dispatch_wait.entry);
+			else
+				run_queue = !blk_mq_sched_needs_restart(hctx);
+			if (run_queue)
+				blk_mq_run_hw_queue(hctx, true);
+		} else if (ret == BLK_STS_RESOURCE){
+			if (blk_mq_sched_needs_restart(hctx))
+				blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
+		} else {
+			if (!blk_mq_sched_needs_restart(hctx))
+				blk_mq_run_hw_queue(hctx, true);
+		}
 
 		return false;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-07-31  4:02 [RFC] blk-mq: clean up the hctx restart Jianchao Wang
@ 2018-07-31  4:58 ` Ming Lei
  2018-07-31  5:19   ` jianchao.wang
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2018-07-31  4:58 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
> Currently, we will always set SCHED_RESTART whenever there are
> requests in hctx->dispatch, then when request is completed and
> freed the hctx queues will be restarted to avoid IO hang. This
> is unnecessary most of time. Especially when there are lots of
> LUNs attached to one host, the RR restart loop could be very
> expensive.

The big RR restart loop has been killed in the following commit:

commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
Author: Ming Lei <ming.lei@redhat.com>
Date:   Mon Jun 25 19:31:48 2018 +0800

    blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()


Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-07-31  4:58 ` Ming Lei
@ 2018-07-31  5:19   ` jianchao.wang
  2018-07-31  6:16     ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: jianchao.wang @ 2018-07-31  5:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

Hi Ming

On 07/31/2018 12:58 PM, Ming Lei wrote:
> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
>> Currently, we will always set SCHED_RESTART whenever there are
>> requests in hctx->dispatch, then when request is completed and
>> freed the hctx queues will be restarted to avoid IO hang. This
>> is unnecessary most of time. Especially when there are lots of
>> LUNs attached to one host, the RR restart loop could be very
>> expensive.
> 
> The big RR restart loop has been killed in the following commit:
> 
> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
> Author: Ming Lei <ming.lei@redhat.com>
> Date:   Mon Jun 25 19:31:48 2018 +0800
> 
>     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
> 
> 

Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
therefore I didn't realize the RR restart loop has already been killed. :)

The RR restart loop could ensure the fairness of sharing some LLDD resource,
not just avoid IO hung. Is it OK to kill it totally ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-07-31  5:19   ` jianchao.wang
@ 2018-07-31  6:16     ` Ming Lei
  2018-08-01  2:17       ` jianchao.wang
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2018-07-31  6:16 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 07/31/2018 12:58 PM, Ming Lei wrote:
> > On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
> >> Currently, we will always set SCHED_RESTART whenever there are
> >> requests in hctx->dispatch, then when request is completed and
> >> freed the hctx queues will be restarted to avoid IO hang. This
> >> is unnecessary most of time. Especially when there are lots of
> >> LUNs attached to one host, the RR restart loop could be very
> >> expensive.
> > 
> > The big RR restart loop has been killed in the following commit:
> > 
> > commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
> > Author: Ming Lei <ming.lei@redhat.com>
> > Date:   Mon Jun 25 19:31:48 2018 +0800
> > 
> >     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
> > 
> > 
> 
> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
> therefore I didn't realize the RR restart loop has already been killed. :)
> 
> The RR restart loop could ensure the fairness of sharing some LLDD resource,
> not just avoid IO hung. Is it OK to kill it totally ?

Yeah, it is, also the fairness might be improved a bit by the way in
commit 97889f9ac24f8d2fc, especially inside driver tag allocation
algorithem.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-07-31  6:16     ` Ming Lei
@ 2018-08-01  2:17       ` jianchao.wang
  2018-08-01  8:58         ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: jianchao.wang @ 2018-08-01  2:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

Hi Ming

Thanks for your kindly response.

On 07/31/2018 02:16 PM, Ming Lei wrote:
> On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 07/31/2018 12:58 PM, Ming Lei wrote:
>>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
>>>> Currently, we will always set SCHED_RESTART whenever there are
>>>> requests in hctx->dispatch, then when request is completed and
>>>> freed the hctx queues will be restarted to avoid IO hang. This
>>>> is unnecessary most of time. Especially when there are lots of
>>>> LUNs attached to one host, the RR restart loop could be very
>>>> expensive.
>>>
>>> The big RR restart loop has been killed in the following commit:
>>>
>>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
>>> Author: Ming Lei <ming.lei@redhat.com>
>>> Date:   Mon Jun 25 19:31:48 2018 +0800
>>>
>>>     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
>>>
>>>
>>
>> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
>> therefore I didn't realize the RR restart loop has already been killed. :)
>>
>> The RR restart loop could ensure the fairness of sharing some LLDD resource,
>> not just avoid IO hung. Is it OK to kill it totally ?
> 
> Yeah, it is, also the fairness might be improved a bit by the way in
> commit 97889f9ac24f8d2fc, especially inside driver tag allocation
> algorithem.
> 

Would you mind to detail more here ?

Regarding the driver tag case:
For example:

q_a         q_b        q_c       q_d
hctx0       hctx0      hctx0     hctx0

                 tags

Total number of tags is 32
All of these 4 q are active.

So every q has 8 tags.

If all of these 4 q have used up their 8 tags, they have to wait.

When part of the in-flight requests q_a are completed, tags are freed.
but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b.
However, due to the limits in hctx_may_queue, q_b still cannot get the
tags. The RR restart also will not wake up q_a.
This is unfair for q_a.

When we remove RR restart fashion, at least, the q_a will be waked up by
the hctx restart.
Is this the improvement of fairness you said in driver tag allocation ?

Think further, it seems that it only works for case with io scheduler.
w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will
not work there.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-01  2:17       ` jianchao.wang
@ 2018-08-01  8:58         ` Ming Lei
  2018-08-01 13:37           ` jianchao.wang
  2018-08-02 15:52             ` Bart Van Assche
  0 siblings, 2 replies; 17+ messages in thread
From: Ming Lei @ 2018-08-01  8:58 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 07/31/2018 02:16 PM, Ming Lei wrote:
> > On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 07/31/2018 12:58 PM, Ming Lei wrote:
> >>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
> >>>> Currently, we will always set SCHED_RESTART whenever there are
> >>>> requests in hctx->dispatch, then when request is completed and
> >>>> freed the hctx queues will be restarted to avoid IO hang. This
> >>>> is unnecessary most of time. Especially when there are lots of
> >>>> LUNs attached to one host, the RR restart loop could be very
> >>>> expensive.
> >>>
> >>> The big RR restart loop has been killed in the following commit:
> >>>
> >>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
> >>> Author: Ming Lei <ming.lei@redhat.com>
> >>> Date:   Mon Jun 25 19:31:48 2018 +0800
> >>>
> >>>     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
> >>>
> >>>
> >>
> >> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
> >> therefore I didn't realize the RR restart loop has already been killed. :)
> >>
> >> The RR restart loop could ensure the fairness of sharing some LLDD resource,
> >> not just avoid IO hung. Is it OK to kill it totally ?
> > 
> > Yeah, it is, also the fairness might be improved a bit by the way in
> > commit 97889f9ac24f8d2fc, especially inside driver tag allocation
> > algorithem.
> > 
> 
> Would you mind to detail more here ?
> 
> Regarding the driver tag case:
> For example:
> 
> q_a         q_b        q_c       q_d
> hctx0       hctx0      hctx0     hctx0
> 
>                  tags
> 
> Total number of tags is 32
> All of these 4 q are active.
> 
> So every q has 8 tags.
> 
> If all of these 4 q have used up their 8 tags, they have to wait.
> 
> When part of the in-flight requests q_a are completed, tags are freed.
> but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b.

1) in case of IO scheduler
q_a should be waken up because q_a->hctx0 is added to one wq of the tags if
no tag is available, see blk_mq_mark_tag_wait().

2) in case of none scheduler
q_a should be waken up too, see blk_mq_get_tag().

So I don't understand why you mentioned that q_a can't be waken up.

> However, due to the limits in hctx_may_queue, q_b still cannot get the
> tags. The RR restart also will not wake up q_a.
> This is unfair for q_a.
> 
> When we remove RR restart fashion, at least, the q_a will be waked up by
> the hctx restart.
> Is this the improvement of fairness you said in driver tag allocation ?

I mean the fairness is totally covered by the general tag allocation
algorithm now, which is sort of FIFO style because of waitqueue, but RR
restart wakes up queue in the order of request queue.

> 
> Think further, it seems that it only works for case with io scheduler.
> w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will
> not work there.

When one tag is freed, the sbitmap queue will be waken up, then some of
allocation may be satisfied, this way works for both IO sched and none.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-01  8:58         ` Ming Lei
@ 2018-08-01 13:37           ` jianchao.wang
  2018-08-02 10:39             ` Ming Lei
  2018-08-02 15:52             ` Bart Van Assche
  1 sibling, 1 reply; 17+ messages in thread
From: jianchao.wang @ 2018-08-01 13:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

Hi Ming

On 08/01/2018 04:58 PM, Ming Lei wrote:
> On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your kindly response.
>>
>> On 07/31/2018 02:16 PM, Ming Lei wrote:
>>> On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 07/31/2018 12:58 PM, Ming Lei wrote:
>>>>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
>>>>>> Currently, we will always set SCHED_RESTART whenever there are
>>>>>> requests in hctx->dispatch, then when request is completed and
>>>>>> freed the hctx queues will be restarted to avoid IO hang. This
>>>>>> is unnecessary most of time. Especially when there are lots of
>>>>>> LUNs attached to one host, the RR restart loop could be very
>>>>>> expensive.
>>>>>
>>>>> The big RR restart loop has been killed in the following commit:
>>>>>
>>>>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
>>>>> Author: Ming Lei <ming.lei@redhat.com>
>>>>> Date:   Mon Jun 25 19:31:48 2018 +0800
>>>>>
>>>>>     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
>>>>>
>>>>>
>>>>
>>>> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
>>>> therefore I didn't realize the RR restart loop has already been killed. :)
>>>>
>>>> The RR restart loop could ensure the fairness of sharing some LLDD resource,
>>>> not just avoid IO hung. Is it OK to kill it totally ?
>>>
>>> Yeah, it is, also the fairness might be improved a bit by the way in
>>> commit 97889f9ac24f8d2fc, especially inside driver tag allocation
>>> algorithem.
>>>
>>
>> Would you mind to detail more here ?
>>
>> Regarding the driver tag case:
>> For example:
>>
>> q_a         q_b        q_c       q_d
>> hctx0       hctx0      hctx0     hctx0
>>
>>                  tags
>>
>> Total number of tags is 32
>> All of these 4 q are active.
>>
>> So every q has 8 tags.
>>
>> If all of these 4 q have used up their 8 tags, they have to wait.
>>
>> When part of the in-flight requests q_a are completed, tags are freed.
>> but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b.
> 
> 1) in case of IO scheduler
> q_a should be waken up because q_a->hctx0 is added to one wq of the tags if
> no tag is available, see blk_mq_mark_tag_wait().
> 
> 2) in case of none scheduler
> q_a should be waken up too, see blk_mq_get_tag().
> 
> So I don't understand why you mentioned that q_a can't be waken up.

There are multiple sbq_wait_states in one sbitmap_queue and __sbq_wake_up
will only wake up the waiters on one of them one time. Please refer to __sbq_wake_up.

> 
>> However, due to the limits in hctx_may_queue, q_b still cannot get the
>> tags. The RR restart also will not wake up q_a.
>> This is unfair for q_a.
>>
>> When we remove RR restart fashion, at least, the q_a will be waked up by
>> the hctx restart.
>> Is this the improvement of fairness you said in driver tag allocation ?
> 
> I mean the fairness is totally covered by the general tag allocation
> algorithm now, which is sort of FIFO style because of waitqueue, but RR
> restart wakes up queue in the order of request queue.

Yes, I got your point.

> 
>>
>> Think further, it seems that it only works for case with io scheduler.
>> w/o io scheduler, tasks will wait in blk_mq_get_request. restart hctx will
>> not work there.
> 
> When one tag is freed, the sbitmap queue will be waken up, then some of
> allocation may be satisfied, this way works for both IO sched and none.
> 
> Thanks, 
> Ming
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-01 13:37           ` jianchao.wang
@ 2018-08-02 10:39             ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2018-08-02 10:39 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, bart.vanassche, linux-block, linux-kernel

On Wed, Aug 01, 2018 at 09:37:08PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 08/01/2018 04:58 PM, Ming Lei wrote:
> > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> Thanks for your kindly response.
> >>
> >> On 07/31/2018 02:16 PM, Ming Lei wrote:
> >>> On Tue, Jul 31, 2018 at 01:19:42PM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 07/31/2018 12:58 PM, Ming Lei wrote:
> >>>>> On Tue, Jul 31, 2018 at 12:02:15PM +0800, Jianchao Wang wrote:
> >>>>>> Currently, we will always set SCHED_RESTART whenever there are
> >>>>>> requests in hctx->dispatch, then when request is completed and
> >>>>>> freed the hctx queues will be restarted to avoid IO hang. This
> >>>>>> is unnecessary most of time. Especially when there are lots of
> >>>>>> LUNs attached to one host, the RR restart loop could be very
> >>>>>> expensive.
> >>>>>
> >>>>> The big RR restart loop has been killed in the following commit:
> >>>>>
> >>>>> commit 97889f9ac24f8d2fc8e703ea7f80c162bab10d4d
> >>>>> Author: Ming Lei <ming.lei@redhat.com>
> >>>>> Date:   Mon Jun 25 19:31:48 2018 +0800
> >>>>>
> >>>>>     blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
> >>>>>
> >>>>>
> >>>>
> >>>> Oh, sorry, I didn't look into this patch due to its title when iterated the mail list,
> >>>> therefore I didn't realize the RR restart loop has already been killed. :)
> >>>>
> >>>> The RR restart loop could ensure the fairness of sharing some LLDD resource,
> >>>> not just avoid IO hung. Is it OK to kill it totally ?
> >>>
> >>> Yeah, it is, also the fairness might be improved a bit by the way in
> >>> commit 97889f9ac24f8d2fc, especially inside driver tag allocation
> >>> algorithem.
> >>>
> >>
> >> Would you mind to detail more here ?
> >>
> >> Regarding the driver tag case:
> >> For example:
> >>
> >> q_a         q_b        q_c       q_d
> >> hctx0       hctx0      hctx0     hctx0
> >>
> >>                  tags
> >>
> >> Total number of tags is 32
> >> All of these 4 q are active.
> >>
> >> So every q has 8 tags.
> >>
> >> If all of these 4 q have used up their 8 tags, they have to wait.
> >>
> >> When part of the in-flight requests q_a are completed, tags are freed.
> >> but the __sbq_wake_up doesn't wake up the q_a, it may wake up q_b.
> > 
> > 1) in case of IO scheduler
> > q_a should be waken up because q_a->hctx0 is added to one wq of the tags if
> > no tag is available, see blk_mq_mark_tag_wait().
> > 
> > 2) in case of none scheduler
> > q_a should be waken up too, see blk_mq_get_tag().
> > 
> > So I don't understand why you mentioned that q_a can't be waken up.
> 
> There are multiple sbq_wait_states in one sbitmap_queue and __sbq_wake_up
> will only wake up the waiters on one of them one time. Please refer to __sbq_wake_up.

Yes, the multiple wqs are waken up in RR style, which is still fair generally speaking.
And there is no such issue of always not waking up 'q_a' when request is completed on
this queue, is there?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-01  8:58         ` Ming Lei
@ 2018-08-02 15:52             ` Bart Van Assche
  2018-08-02 15:52             ` Bart Van Assche
  1 sibling, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 15:52 UTC (permalink / raw)
  To: ming.lei, jianchao.w.wang; +Cc: linux-kernel, linux-block, axboe

On Wed, 2018-08-01 at 16:58 +0800, Ming Lei wrote:
> On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
> > However, due to the limits in hctx_may_queue, q_b s=
till cannot get the
> > tags. The RR restart also will not wake up q_a.
> > This is unfair for q_a.
> >=20
> > When we remove RR restart fashion, at least, the q_a will b=
e waked up by
> > the hctx restart.
> > Is this the improvement of fairness you said in driver tag allo=
cation ?
>=20
> I mean the fairness is totally covered by the general tag allocation
> algorithm now, which is sort of FIFO style because of waitqueue, but =
RR
> restart wakes up queue in the order of request queue.

>From sbitmap.h:

#define SBQ_WAIT_QUEUES 8

What do you think is the effect of your patch if more than eight LUNs are
active and the SCSI queue is full?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
@ 2018-08-02 15:52             ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 15:52 UTC (permalink / raw)
  To: ming.lei, jianchao.w.wang; +Cc: linux-kernel, linux-block, axboe

On Wed, 2018-08-01 at 16:58 +0800, Ming Lei wrote:
> On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
> > However, due to the limits in hctx_may_queue, q_b still cannot get the
> > tags. The RR restart also will not wake up q_a.
> > This is unfair for q_a.
> > 
> > When we remove RR restart fashion, at least, the q_a will be waked up by
> > the hctx restart.
> > Is this the improvement of fairness you said in driver tag allocation ?
> 
> I mean the fairness is totally covered by the general tag allocation
> algorithm now, which is sort of FIFO style because of waitqueue, but RR
> restart wakes up queue in the order of request queue.

From sbitmap.h:

#define SBQ_WAIT_QUEUES 8

What do you think is the effect of your patch if more than eight LUNs are
active and the SCSI queue is full?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-02 15:52             ` Bart Van Assche
  (?)
@ 2018-08-02 16:58             ` Ming Lei
  2018-08-02 17:08                 ` Bart Van Assche
  -1 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2018-08-02 16:58 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: jianchao.w.wang, linux-kernel, linux-block, axboe

On Thu, Aug 02, 2018 at 03:52:00PM +0000, Bart Van Assche wrote:
> On Wed, 2018-08-01 at 16:58 +0800, Ming Lei wrote:
> > On Wed, Aug 01, 2018 at 10:17:30AM +0800, jianchao.wang wrote:
> > > However, due to the limits in hctx_may_queue, q_b still cannot get the
> > > tags. The RR restart also will not wake up q_a.
> > > This is unfair for q_a.
> > > 
> > > When we remove RR restart fashion, at least, the q_a will be waked up by
> > > the hctx restart.
> > > Is this the improvement of fairness you said in driver tag allocation ?
> > 
> > I mean the fairness is totally covered by the general tag allocation
> > algorithm now, which is sort of FIFO style because of waitqueue, but RR
> > restart wakes up queue in the order of request queue.
> 
> From sbitmap.h:
> 
> #define SBQ_WAIT_QUEUES 8
> 
> What do you think is the effect of your patch if more than eight LUNs are
> active and the SCSI queue is full?

Jens introduces multiple wait queues for scattering atomic operation on
multiple counters, in theory, the way is understood easily if you just
treat it as one whole queue in concept.

And about the situations you mentioned, no any special as normal cases
or thousands of LUNs. Just a batch of queues are waken up from one
single wait queue(sbq_wait_state), and inside each wait queue, queues
are handled actually in FIFO order.

Or what is your expected ideal behaviour about fairness?

thanks, 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-02 16:58             ` Ming Lei
@ 2018-08-02 17:08                 ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 17:08 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Fri, 2018-08-03 at 00:58 +0800, Ming Lei wrote:
> And about the situations you mentioned, no any special as normal case=
s
> or thousands of LUNs. Just a batch of queues are waken up from one
> single wait queue(sbq_wait_state), and inside each wait queue=
, queues
> are handled actually in FIFO order.
>=20
> Or what is your expected ideal behaviour about fairness?

Hello Ming,

What I expect is if the number of LUNs is really large that all LUNs are tr=
eated
equally. Unless someone can set up a test that demonstrates that this is st=
ill
the case for commit 97889f9ac24f ("blk-mq: remove synchronize_rcu()=
 from
blk_mq_del_queue_tag_set()"), I will assume that th=
at commit breaks fairness.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
@ 2018-08-02 17:08                 ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 17:08 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Fri, 2018-08-03 at 00:58 +0800, Ming Lei wrote:
> And about the situations you mentioned, no any special as normal cases
> or thousands of LUNs. Just a batch of queues are waken up from one
> single wait queue(sbq_wait_state), and inside each wait queue, queues
> are handled actually in FIFO order.
> 
> Or what is your expected ideal behaviour about fairness?

Hello Ming,

What I expect is if the number of LUNs is really large that all LUNs are treated
equally. Unless someone can set up a test that demonstrates that this is still
the case for commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from
blk_mq_del_queue_tag_set()"), I will assume that that commit breaks fairness.

Bart.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-02 17:08                 ` Bart Van Assche
  (?)
@ 2018-08-02 17:17                 ` Ming Lei
  2018-08-02 17:24                     ` Bart Van Assche
  -1 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2018-08-02 17:17 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Thu, Aug 02, 2018 at 05:08:55PM +0000, Bart Van Assche wrote:
> On Fri, 2018-08-03 at 00:58 +0800, Ming Lei wrote:
> > And about the situations you mentioned, no any special as normal cases
> > or thousands of LUNs. Just a batch of queues are waken up from one
> > single wait queue(sbq_wait_state), and inside each wait queue, queues
> > are handled actually in FIFO order.
> > 
> > Or what is your expected ideal behaviour about fairness?
> 
> Hello Ming,
> 
> What I expect is if the number of LUNs is really large that all LUNs are treated
> equally. Unless someone can set up a test that demonstrates that this is still

Some of idle LUNs shouldn't be treated equally as other LUNs which need
to serve.  

Also as I mentioned, the original RR style isn't better than the new
way actually, since now we handle queues in sort of FIFO style, thanks
wait queue.

Not mentioning big CPU utilization is consumed unnecessarily for iterating
over all queues even though there is only one active queue, is this fair from
system view?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-02 17:17                 ` Ming Lei
@ 2018-08-02 17:24                     ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 17:24 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Fri, 2018-08-03 at 01:17 +0800, Ming Lei wrote:
> Not mentioning big CPU utilization is consumed unnecessarily for iter=
ating
> over all queues even though there is only one active queue, is this f=
air from
> system view?

I hope that someone will come up some day with a better solution than the
list_for_each_entry_rcu_rr() loop. But as long as such =
a solution has not yet
been found that loop should stay in order to preserve fair handling of larg=
e
numbers of LUNs.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
@ 2018-08-02 17:24                     ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-08-02 17:24 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Fri, 2018-08-03 at 01:17 +0800, Ming Lei wrote:
> Not mentioning big CPU utilization is consumed unnecessarily for iterating
> over all queues even though there is only one active queue, is this fair from
> system view?

I hope that someone will come up some day with a better solution than the
list_for_each_entry_rcu_rr() loop. But as long as such a solution has not yet
been found that loop should stay in order to preserve fair handling of large
numbers of LUNs.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] blk-mq: clean up the hctx restart
  2018-08-02 17:24                     ` Bart Van Assche
  (?)
@ 2018-08-03  0:35                     ` Ming Lei
  -1 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2018-08-03  0:35 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-block, jianchao.w.wang, axboe

On Thu, Aug 02, 2018 at 05:24:15PM +0000, Bart Van Assche wrote:
> On Fri, 2018-08-03 at 01:17 +0800, Ming Lei wrote:
> > Not mentioning big CPU utilization is consumed unnecessarily for iterating
> > over all queues even though there is only one active queue, is this fair from
> > system view?
> 
> I hope that someone will come up some day with a better solution than the
> list_for_each_entry_rcu_rr() loop. But as long as such a solution has not yet
> been found that loop should stay in order to preserve fair handling of large
> numbers of LUNs.

Again list_for_each_entry_rcu_rr() isn't fair for small number of active
queue in this case, because this way decreases IOPS a lot and consumes CPU
much just for checking all thousands of queues unnecessarily, even though 99%
of them are idle.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-08-03  0:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31  4:02 [RFC] blk-mq: clean up the hctx restart Jianchao Wang
2018-07-31  4:58 ` Ming Lei
2018-07-31  5:19   ` jianchao.wang
2018-07-31  6:16     ` Ming Lei
2018-08-01  2:17       ` jianchao.wang
2018-08-01  8:58         ` Ming Lei
2018-08-01 13:37           ` jianchao.wang
2018-08-02 10:39             ` Ming Lei
2018-08-02 15:52           ` Bart Van Assche
2018-08-02 15:52             ` Bart Van Assche
2018-08-02 16:58             ` Ming Lei
2018-08-02 17:08               ` Bart Van Assche
2018-08-02 17:08                 ` Bart Van Assche
2018-08-02 17:17                 ` Ming Lei
2018-08-02 17:24                   ` Bart Van Assche
2018-08-02 17:24                     ` Bart Van Assche
2018-08-03  0:35                     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.