All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
@ 2017-10-18 10:22 Roman Pen
  2017-10-19 17:47   ` Bart Van Assche
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Roman Pen @ 2017-10-18 10:22 UTC (permalink / raw)
  Cc: Roman Pen, linux-kernel, linux-block, Bart Van Assche,
	Christoph Hellwig, Hannes Reinecke, Jens Axboe

Hi all,

the patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
were noticed in dd->fifo_list of mq-deadline scheduler.

Seeming possible sequence of events:

1. Request A of queue A is inserted into dd->fifo_list of the scheduler.

2. Request B of queue A bypasses scheduler and goes directly to
   hctx->dispatch.

3. Request C of queue B is inserted.

4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
   empty (request B is in the list) hctx is only marked for for next restart
   and request A is left in a list (see comment "So it's best to leave them
   there for as long as we can. Mark the hw queue as needing a restart in
   that case." in blk-mq-sched.c)

5. Eventually request B is completed/freed and blk_mq_sched_restart() is
   called, but by chance hctx from queue B is chosen for restart and request C
   gets a chance to be dispatched.

6. Eventually request C is completed/freed and blk_mq_sched_restart() is
   called, but shared_hctx_restart for queue B is zero and we return without
   attempt to restart hctx from queue A, thus request A is stuck forever.

But stalling queue is not the only one problem with blk_mq_sched_restart().
My tests show that those loops thru all queues and hctxs can be very costly,
even with shared_hctx_restart counter, which aims to fix performance issue.
For my tests I create 128 devices with 64 hctx each, which share same tags
set.

The following is the fio and ftrace output for v4.14-rc4 kernel:

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit     Time            Avg             s^2
  --------                  ---     ----            ---             ---
  blk_mq_sched_restart     16347    9540759 us      583.639 us      8804801 us
  blk_mq_sched_restart      7884    6073471 us      770.354 us      8780054 us
  blk_mq_sched_restart     14176    7586794 us      535.185 us      2822731 us
  blk_mq_sched_restart      7843    6205435 us      791.206 us      12424960 us
  blk_mq_sched_restart      1490    4786107 us      3212.153 us     1949753 us
  blk_mq_sched_restart      7892    6039311 us      765.244 us      2994627 us
  blk_mq_sched_restart     15382    7511126 us      488.306 us      3090912 us
  [cut]

And here are results with two patches reverted:
   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit      Time            Avg             s^2
  --------                  ---      ----            ---             ---
  blk_mq_sched_restart      50699    8802.349 us     0.173 us        121.771 us
  blk_mq_sched_restart      50362    8740.470 us     0.173 us        161.494 us
  blk_mq_sched_restart      50402    9066.337 us     0.179 us        113.009 us
  blk_mq_sched_restart      50104    9366.197 us     0.186 us        188.645 us
  blk_mq_sched_restart      50375    9317.727 us     0.184 us        54.218 us
  blk_mq_sched_restart      50136    9311.657 us     0.185 us        446.790 us
  blk_mq_sched_restart      50103    9179.625 us     0.183 us        114.472 us
  [cut]

Timings and stdevs are terrible, which leads to significant difference:
570MB/s vs 1280MB/s.

This is RFC since current patch fixes queue stalling but performance issue
still remains and for me is not clear is it better to improve commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
making percpu restart lists (to avoid looping and to dequeue hctx immediately)
or revert it (frankly I did not notice any difference on small number of
devices and hctxs, when looping issue does not impact much).

--
Roman

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
---
 block/blk-mq-sched.c   | 10 +++++-----
 block/blk-mq-tag.c     |  1 +
 block/blk-mq-tag.h     |  1 +
 block/blk-mq.c         |  4 ++--
 include/linux/blkdev.h |  2 --
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..a19a7f275173 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -60,10 +60,10 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		return;
 
 	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-		struct request_queue *q = hctx->queue;
+		struct blk_mq_tags *tags = hctx->tags;
 
 		if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-			atomic_inc(&q->shared_hctx_restart);
+			atomic_inc(&tags->shared_hctx_restart);
 	} else
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
@@ -74,10 +74,10 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		return false;
 
 	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-		struct request_queue *q = hctx->queue;
+		struct blk_mq_tags *tags = hctx->tags;
 
 		if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-			atomic_dec(&q->shared_hctx_restart);
+			atomic_dec(&tags->shared_hctx_restart);
 	} else
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 
@@ -312,7 +312,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 		 * If this is 0, then we know that no hardware queues
 		 * have RESTART marked. We're done.
 		 */
-		if (!atomic_read(&queue->shared_hctx_restart))
+		if (!atomic_read(&tags->shared_hctx_restart))
 			return;
 
 		rcu_read_lock();
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..598f8e0095ff 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -385,6 +385,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
+	atomic_set(&tags->shared_hctx_restart, 0);
 
 	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
 }
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 5cb51e53cc03..adf05c8811cd 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -11,6 +11,7 @@ struct blk_mq_tags {
 	unsigned int nr_reserved_tags;
 
 	atomic_t active_queues;
+	atomic_t shared_hctx_restart;
 
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..7639f978ea2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2120,11 +2120,11 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared) {
 			if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-				atomic_inc(&q->shared_hctx_restart);
+				atomic_inc(&hctx->tags->shared_hctx_restart);
 			hctx->flags |= BLK_MQ_F_TAG_SHARED;
 		} else {
 			if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-				atomic_dec(&q->shared_hctx_restart);
+				atomic_dec(&hctx->tags->shared_hctx_restart);
 			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
 		}
 	}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2a5d52fa90f5..3852a9ea87d0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,8 +393,6 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
-	atomic_t		shared_hctx_restart;
-
 	struct blk_queue_stats	*stats;
 	struct rq_wb		*rq_wb;
 
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-18 10:22 [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart Roman Pen
@ 2017-10-19 17:47   ` Bart Van Assche
       [not found] ` <20171020133943.GA31275@ming.t460p>
  2017-11-07 15:54   ` Bart Van Assche
  2 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-10-19 17:47 UTC (permalink / raw)
  To: roman.penyaev
  Cc: Bart Van Assche, linux-kernel, linux-block, hare, axboe, hch

T24gV2VkLCAyMDE3LTEwLTE4IGF0IDEyOjIyICswMjAwLCBSb21hbiBQZW4gd3JvdGU6DQo+IHRo
ZSBwYXRjaCBiZWxvdyBmaXhlcyBxdWV1ZSBzdGFsbGluZyB3aGVuIHNoYXJlZCBoY3R4IG1hcmtl
ZCBmb3IgcmVzdGFydA0KPiAoQkxLX01RX1NfU0NIRURfUkVTVEFSVCBiaXQpIGJ1dCBxLT5zaGFy
ZWRfaGN0eF9yZXN0YXJ0IHN0YXlzIHplcm8uICBUaGUNCj4gcm9vdCBjYXVzZSBpcyB0aGF0IGhj
dHhzIGFyZSBzaGFyZWQgYmV0d2VlbiBxdWV1ZXMsIGJ1dCAnc2hhcmVkX2hjdHhfcmVzdGFydCcN
Cj4gYmVsb25ncyB0byB0aGUgcGFydGljdWxhciBxdWV1ZSwgd2hpY2ggaW4gZmFjdCBtYXkgbm90
IG5lZWQgdG8gYmUgcmVzdGFydGVkLA0KPiB0aHVzIHdlIHJldHVybiBmcm9tIGJsa19tcV9zY2hl
ZF9yZXN0YXJ0KCkgYW5kIGxlYXZlIHNoYXJlZCBoY3R4IG9mIGFub3RoZXINCj4gcXVldWUgbmV2
ZXIgcmVzdGFydGVkLg0KPiANCj4gVGhlIGZpeCBpcyB0byBtYWtlIHNoYXJlZF9oY3R4X3Jlc3Rh
cnQgY291bnRlciBiZWxvbmcgbm90IHRvIHRoZSBxdWV1ZSwgYnV0DQo+IHRvIHRhZ3MsIHRoZXJl
YnkgY291bnRlciB3aWxsIHJlZmxlY3QgcmVhbCBudW1iZXIgb2Ygc2hhcmVkIGhjdHggbmVlZGVk
IHRvDQo+IGJlIHJlc3RhcnRlZC4NCg0KSGVsbG8gUm9tYW4sDQoNClRoZSBwYXRjaCB5b3UgcG9z
dGVkIGxvb2tzIGZpbmUgdG8gbWUgYnV0IHNlZWluZyB0aGlzIHBhdGNoIGFuZCB0aGUgcGF0Y2gN
CmRlc2NyaXB0aW9uIG1ha2VzIG1lIHdvbmRlciB3aHkgdGhpcyBoYWQgbm90IGJlZW4gbm90aWNl
ZCBiZWZvcmUuIEFyZSB5b3UNCnBlcmhhcHMgdXNpbmcgYSBibG9jayBkcml2ZXIgdGhhdCByZXR1
cm5zIEJMS19TVFNfUkVTT1VSQ0UgbW9yZSBvZnRlbiB0aGFuDQpvdGhlciBibG9jayBkcml2ZXJz
PyBEaWQgeW91IHBlcmhhcHMgcnVuIGludG8gdGhpcyB3aXRoIHRoZSBJbmZpbmliYW5kDQpuZXR3
b3JrIGJsb2NrIGRldmljZSAoSUJOQkQpIGRyaXZlcj8gTm8gbWF0dGVyIHdoYXQgZHJpdmVyIHRy
aWdnZXJlZCB0aGlzLA0KSSB0aGluayB0aGlzIGJ1ZyBzaG91bGQgYmUgZml4ZWQuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
@ 2017-10-19 17:47   ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-10-19 17:47 UTC (permalink / raw)
  To: roman.penyaev
  Cc: Bart Van Assche, linux-kernel, linux-block, hare, axboe, hch

On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
> the patch below fixes queue stalling when shared hctx marked for restart
> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
> belongs to the particular queue, which in fact may not need to be restarted,
> thus we return from blk_mq_sched_restart() and leave shared hctx of another
> queue never restarted.
> 
> The fix is to make shared_hctx_restart counter belong not to the queue, but
> to tags, thereby counter will reflect real number of shared hctx needed to
> be restarted.

Hello Roman,

The patch you posted looks fine to me but seeing this patch and the patch
description makes me wonder why this had not been noticed before. Are you
perhaps using a block driver that returns BLK_STS_RESOURCE more often than
other block drivers? Did you perhaps run into this with the Infiniband
network block device (IBNBD) driver? No matter what driver triggered this,
I think this bug should be fixed.

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-19 17:47   ` Bart Van Assche
  (?)
@ 2017-10-20  9:39   ` Roman Penyaev
  2017-10-20 20:05       ` Bart Van Assche
  -1 siblings, 1 reply; 13+ messages in thread
From: Roman Penyaev @ 2017-10-20  9:39 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-block, hare, axboe, hch

Hi Bart,

On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>
> Hello Roman,
>
> The patch you posted looks fine to me but seeing this patch and the patch
> description makes me wonder why this had not been noticed before.

This is a good question, which I could not answer.  I tried to simulate the
same behaviour (completion timings, completion pinning, number of submission
queues, shared tags, etc) on null block.  but what I see is that
*_sched_restart()
never observes 'shared_hctx_restart',  literally never (I made a counter when
we take a path and start looking for a hctx to restart, and a counter stays 0).

That makes me nervous and then I gave up.  After some time I want return to
that and try to reproduce the problem on something else, say nvme.

> Are you perhaps using a block driver that returns BLK_STS_RESOURCE more
> often than other block drivers? Did you perhaps run into this with the
> Infiniband network block device (IBNBD) driver?

Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags
and 1 hctx for each queue (blk device),  thus I never run out of internal tags
and never return BLK_STS_RESOURCE.

Indeed, not modified IBNBD does internal tags management.  This was needed
because each queue (block device) was created with hctx number (nr_hw_queues)
equal to number of cpus on the system, but blk-mq tags set is shared only
between hctx, not globally, which led to need to return BLK_STS_RESOURCE
and queues restarts.

But, with mq scheduler situation changed: 1 hctx with shared tags can be
specified for all hundreds of devices without any performance impact.

Testing this configuration (1 hctx, shared tags, mq-deadline) immediately
shows these two problems: request stalling and slow loops inside
blk_mq_sched_restart().

> No matter what driver triggered this, I think this bug should be fixed.

Yes, queue stalling can be easily fixed.  I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.

But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix?  Likely I am missing some test scenario
which would be great to know about.

--
Roman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-20  9:39   ` Roman Penyaev
@ 2017-10-20 20:05       ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-10-20 20:05 UTC (permalink / raw)
  To: roman.penyaev; +Cc: hch, linux-kernel, linux-block, hare, axboe

T24gRnJpLCAyMDE3LTEwLTIwIGF0IDExOjM5ICswMjAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBCdXQgd2hhdCBib3RoZXJzIG1lIGlzIHRoZXNlIGxvb29uZyBsb29wcyBpbnNpZGUgYmxrX21x
X3NjaGVkX3Jlc3RhcnQoKSwNCj4gYW5kIHNpbmNlIHlvdSBhcmUgdGhlIGF1dGhvciBvZiB0aGUg
b3JpZ2luYWwgNmQ4YzZjMGY5N2FkICgiYmxrLW1xOiBSZXN0YXJ0DQo+IGEgc2luZ2xlIHF1ZXVl
IGlmIHRhZyBzZXRzIGFyZSBzaGFyZWQiKSBJIHdhbnQgdG8gYXNrIHdoYXQgd2FzIHRoZSBvcmln
aW5hbA0KPiBwcm9ibGVtIHdoaWNoIHlvdSBhdHRlbXB0ZWQgdG8gZml4PyAgTGlrZWx5IEkgYW0g
bWlzc2luZyBzb21lIHRlc3Qgc2NlbmFyaW8NCj4gd2hpY2ggd291bGQgYmUgZ3JlYXQgdG8ga25v
dyBhYm91dC4NCg0KTG9uZyBsb29wcz8gSG93IG1hbnkgcXVldWVzIHNoYXJlIHRoZSBzYW1lIHRh
ZyBzZXQgb24geW91ciBzZXR1cD8gSG93IG1hbnkNCmhhcmR3YXJlIHF1ZXVlcyBkb2VzIHlvdXIg
YmxvY2sgZHJpdmVyIGNyZWF0ZSBwZXIgcmVxdWVzdCBxdWV1ZT8NCg0KQ29tbWl0IDZkOGM2YzBm
OTdhZCBpcyBzb21ldGhpbmcgSSBjYW1lIHVwIHdpdGggdG8gZml4IHF1ZXVlIGxvY2t1cHMgaW4g
dGhlDQpTQ1NJIGFuZCBkbS1tcSBkcml2ZXJzLg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
@ 2017-10-20 20:05       ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-10-20 20:05 UTC (permalink / raw)
  To: roman.penyaev; +Cc: hch, linux-kernel, linux-block, hare, axboe

On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote:
> But what bothers me is these looong loops inside blk_mq_sched_restart(),
> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
> a single queue if tag sets are shared") I want to ask what was the original
> problem which you attempted to fix?  Likely I am missing some test scenario
> which would be great to know about.

Long loops? How many queues share the same tag set on your setup? How many
hardware queues does your block driver create per request queue?

Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
SCSI and dm-mq drivers.

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-20 20:05       ` Bart Van Assche
  (?)
@ 2017-10-23 15:12       ` Roman Penyaev
  2017-11-01 16:50           ` Bart Van Assche
  -1 siblings, 1 reply; 13+ messages in thread
From: Roman Penyaev @ 2017-10-23 15:12 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-kernel, linux-block, hare, axboe

On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche
<Bart.VanAssche@wdc.com> wrote:
> On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote:
>> But what bothers me is these looong loops inside blk_mq_sched_restart(),
>> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
>> a single queue if tag sets are shared") I want to ask what was the original
>> problem which you attempted to fix?  Likely I am missing some test scenario
>> which would be great to know about.
>
> Long loops? How many queues share the same tag set on your setup? How many
> hardware queues does your block driver create per request queue?

Yeah, ok, my mistake. I had to split both issues and should not have described
everything in one go in the first email.  So, take a look.

For my tests I create 128 queues (devices) with 64 hctx each, all queues share
same tags set, then I start 128 fio jobs (1 job per 1 queue).

The following is the fio and ftrace output for v4.14-rc4 kernel
(without any changes):

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s,
maxb=573208KB/s, mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s,
maxb=575312KB/s, mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit     Time            Avg             s^2
  --------                  ---     ----            ---             ---
  blk_mq_sched_restart     16347    9540759 us      583.639 us      8804801 us
  blk_mq_sched_restart      7884    6073471 us      770.354 us      8780054 us
  blk_mq_sched_restart     14176    7586794 us      535.185 us      2822731 us
  blk_mq_sched_restart      7843    6205435 us      791.206 us      12424960 us
  blk_mq_sched_restart      1490    4786107 us      3212.153 us
1949753 us    <<< !!! 3 ms in average !!!
  blk_mq_sched_restart      7892    6039311 us      765.244 us      2994627 us
  blk_mq_sched_restart     15382    7511126 us      488.306 us      3090912 us
  [cut]


And here are results with two patches reverted:

   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s,
mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s,
mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit      Time            Avg             s^2
  --------                  ---      ----            ---             ---
  blk_mq_sched_restart      50699    8802.349 us     0.173 us        121.771 us
  blk_mq_sched_restart      50362    8740.470 us     0.173 us        161.494 us
  blk_mq_sched_restart      50402    9066.337 us     0.179 us        113.009 us
  blk_mq_sched_restart      50104    9366.197 us     0.186 us        188.645 us
  blk_mq_sched_restart      50375    9317.727 us     0.184 us        54.218 us
  blk_mq_sched_restart      50136    9311.657 us     0.185 us        446.790 us
  blk_mq_sched_restart      50103    9179.625 us     0.183 us        114.472 us
  [cut]

The difference is significant: 570MB/s vs 1280MB/s.  E.g. one cpu spent 3 ms in
average iterating over all queues and hctxs in order to find out hctx
to restart.
In total CPUs spent *seconds* in loop.  That seems incredibly long.

> Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
> SCSI and dm-mq drivers.

You mean fairness? (some hctx get less amount of chances to be restarted).
That's why you need to restart them in RR fashion, right?

In IBNBD I also do hctx restarts in RR fashion and for that I put each hctx
which is needed to be restarted in a separate percpu list.  Probably it makes
sense to do the same here?

--
Roman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
       [not found] ` <20171020133943.GA31275@ming.t460p>
@ 2017-10-23 16:12   ` Roman Penyaev
  2017-10-24 10:04     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Penyaev @ 2017-10-23 16:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, linux-block, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Jens Axboe

Hi Ming,

On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote:
> On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:
>> Hi all,
>>
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>>
>> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
>> were noticed in dd->fifo_list of mq-deadline scheduler.
>>
>> Seeming possible sequence of events:
>>
>> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
>>
>> 2. Request B of queue A bypasses scheduler and goes directly to
>>    hctx->dispatch.
>>
>> 3. Request C of queue B is inserted.
>>
>> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
>>    empty (request B is in the list) hctx is only marked for for next restart
>>    and request A is left in a list (see comment "So it's best to leave them
>>    there for as long as we can. Mark the hw queue as needing a restart in
>>    that case." in blk-mq-sched.c)
>>
>> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
>>    called, but by chance hctx from queue B is chosen for restart and request C
>>    gets a chance to be dispatched.
>>
>> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
>>    called, but shared_hctx_restart for queue B is zero and we return without
>>    attempt to restart hctx from queue A, thus request A is stuck forever.
>>
>> But stalling queue is not the only one problem with blk_mq_sched_restart().
>> My tests show that those loops thru all queues and hctxs can be very costly,
>> even with shared_hctx_restart counter, which aims to fix performance issue.
>> For my tests I create 128 devices with 64 hctx each, which share same tags
>> set.
>
> Hi Roman,
>
> I also find the performance issue with RESTART for TAG_SHARED.
>
> But from my analysis, RESTART isn't needed for TAG_SHARED
> because SCSI-MQ has handled the RESTART by itself already, so
> could you test the patch in following link posted days ago to
> see if it fixes your issue?

I can say without any testing: it fixes all the issues :) You've
reverted

   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

with one major difference: you do not handle shared tags in a special
way and restart only requested hctx, instead of iterating over all hctxs
in a queue.

Firstly I have to say that queue stalling issue(#1) and performance
issue(#2) were observed on our in-house rdma driver IBNBD:
https://lwn.net/Articles/718181/ and I've never tried to reproduce
them on SCSI-MQ.

Then your patch brakes RR restarts, which were introduced by this commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
Seems basic idea of that patch is nice, but because of possible big
amount of queues and hctxs patch requires reimplementation.  Eventually
you should get fast hctx restart but in RR fashion. According to my
understanding that does not contradict with your patch.

--
Roman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-23 16:12   ` Roman Penyaev
@ 2017-10-24 10:04     ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2017-10-24 10:04 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: linux-kernel, linux-block, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Jens Axboe

On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote:
> Hi Ming,
> 
> On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote:
> > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:
> >> Hi all,
> >>
> >> the patch below fixes queue stalling when shared hctx marked for restart
> >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
> >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
> >> belongs to the particular queue, which in fact may not need to be restarted,
> >> thus we return from blk_mq_sched_restart() and leave shared hctx of another
> >> queue never restarted.
> >>
> >> The fix is to make shared_hctx_restart counter belong not to the queue, but
> >> to tags, thereby counter will reflect real number of shared hctx needed to
> >> be restarted.
> >>
> >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
> >> were noticed in dd->fifo_list of mq-deadline scheduler.
> >>
> >> Seeming possible sequence of events:
> >>
> >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
> >>
> >> 2. Request B of queue A bypasses scheduler and goes directly to
> >>    hctx->dispatch.
> >>
> >> 3. Request C of queue B is inserted.
> >>
> >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
> >>    empty (request B is in the list) hctx is only marked for for next restart
> >>    and request A is left in a list (see comment "So it's best to leave them
> >>    there for as long as we can. Mark the hw queue as needing a restart in
> >>    that case." in blk-mq-sched.c)
> >>
> >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
> >>    called, but by chance hctx from queue B is chosen for restart and request C
> >>    gets a chance to be dispatched.
> >>
> >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
> >>    called, but shared_hctx_restart for queue B is zero and we return without
> >>    attempt to restart hctx from queue A, thus request A is stuck forever.
> >>
> >> But stalling queue is not the only one problem with blk_mq_sched_restart().
> >> My tests show that those loops thru all queues and hctxs can be very costly,
> >> even with shared_hctx_restart counter, which aims to fix performance issue.
> >> For my tests I create 128 devices with 64 hctx each, which share same tags
> >> set.
> >
> > Hi Roman,
> >
> > I also find the performance issue with RESTART for TAG_SHARED.
> >
> > But from my analysis, RESTART isn't needed for TAG_SHARED
> > because SCSI-MQ has handled the RESTART by itself already, so
> > could you test the patch in following link posted days ago to
> > see if it fixes your issue?
> 
> I can say without any testing: it fixes all the issues :) You've
> reverted
> 
>    8e8320c9315c ("blk-mq: fix performance regression with shared tags")
>    6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
> 
> with one major difference: you do not handle shared tags in a special
> way and restart only requested hctx, instead of iterating over all hctxs
> in a queue.

Hi Roman,

Yeah, that is unnecessary as I explained in detail in the commit log and
introduces lots of overhead, and I can see IOPS is improved by 20%~30% in
my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED.

Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget().

I will post out V3 soon. 

> 
> Firstly I have to say that queue stalling issue(#1) and performance
> issue(#2) were observed on our in-house rdma driver IBNBD:
> https://lwn.net/Articles/718181/ and I've never tried to reproduce
> them on SCSI-MQ.

OK, got it, then you can handle it in SCSI-MQ's way since this way
is used by non-MQ for long time. Or you need to do nothing if your
driver is SCSI based.

> 
> Then your patch brakes RR restarts, which were introduced by this commit
> 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
> Seems basic idea of that patch is nice, but because of possible big
> amount of queues and hctxs patch requires reimplementation.  Eventually
> you should get fast hctx restart but in RR fashion. According to my
> understanding that does not contradict with your patch.

Firstly this way has been run/verified for long time in non-mq, and no one
complained it before.

Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(),
the sdev is added into the 'starved_list' in FIFO style, which is still fair.

So I don't think it is an issue to remove the RR restarts.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-23 15:12       ` Roman Penyaev
@ 2017-11-01 16:50           ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-11-01 16:50 UTC (permalink / raw)
  To: roman.penyaev; +Cc: hch, linux-kernel, linux-block, hare, axboe

T24gTW9uLCAyMDE3LTEwLTIzIGF0IDE3OjEyICswMjAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBPbiBGcmksIE9jdCAyMCwgMjAxNyBhdCAxMDowNSBQTSwgQmFydCBWYW4gQXNzY2hlIHdyb3Rl
Og0KPiA+IENvbW1pdCA2ZDhjNmMwZjk3YWQgaXMgc29tZXRoaW5nIEkgY2FtZSB1cCB3aXRoIHRv
IGZpeCBxdWV1ZSBsb2NrdXBzIGluIHRoZQ0KPiA+IFNDU0kgYW5kIGRtLW1xIGRyaXZlcnMuDQo+
IA0KPiBZb3UgbWVhbiBmYWlybmVzcz8gKHNvbWUgaGN0eCBnZXQgbGVzcyBhbW91bnQgb2YgY2hh
bmNlcyB0byBiZSByZXN0YXJ0ZWQpLg0KPiBUaGF0J3Mgd2h5IHlvdSBuZWVkIHRvIHJlc3RhcnQg
dGhlbSBpbiBSUiBmYXNoaW9uLCByaWdodD8NCg0KQ29tbWl0IDZkOGM2YzBmOTdhZCBub3Qgb25s
eSBpbXByb3ZlZCBmYWlybmVzcyBidXQgYWxzbyBmaXhlZCBhIGxvY2t1cCBzY2VuYXJpby4NCkFm
dGVyIGEgZHJpdmVyIHRhZyBoYXMgYmVlbiBmcmVlZCBub3Qgb25seSB0aGUgaHcgcXVldWVzIG9m
IHRoZSByZXF1ZXN0IHF1ZXVlIGZvcg0Kd2hpY2ggYSByZXF1ZXN0IGNvbXBsZXRlZCBtdXN0IGJl
IHJlZXhhbWluZWQgYnV0IGFsc28gdGhlIGh3IHF1ZXVlcyBhc3NvY2lhdGVkDQp3aXRoIHRoZSBv
dGhlciByZXF1ZXN0IHF1ZXVlcyB0aGF0IHNoYXJlIHRoZSBzYW1lIHRhZyBzZXQuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
@ 2017-11-01 16:50           ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-11-01 16:50 UTC (permalink / raw)
  To: roman.penyaev; +Cc: hch, linux-kernel, linux-block, hare, axboe

On Mon, 2017-10-23 at 17:12 +0200, Roman Penyaev wrote:
> On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche wrote:
> > Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
> > SCSI and dm-mq drivers.
> 
> You mean fairness? (some hctx get less amount of chances to be restarted).
> That's why you need to restart them in RR fashion, right?

Commit 6d8c6c0f97ad not only improved fairness but also fixed a lockup scenario.
After a driver tag has been freed not only the hw queues of the request queue for
which a request completed must be reexamined but also the hw queues associated
with the other request queues that share the same tag set.

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
  2017-10-18 10:22 [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart Roman Pen
@ 2017-11-07 15:54   ` Bart Van Assche
       [not found] ` <20171020133943.GA31275@ming.t460p>
  2017-11-07 15:54   ` Bart Van Assche
  2 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-11-07 15:54 UTC (permalink / raw)
  To: roman.penyaev
  Cc: Bart Van Assche, linux-kernel, linux-block, hare, axboe, hch

T24gV2VkLCAyMDE3LTEwLTE4IGF0IDEyOjIyICswMjAwLCBSb21hbiBQZW4gd3JvdGU6DQo+IHRo
ZSBwYXRjaCBiZWxvdyBmaXhlcyBxdWV1ZSBzdGFsbGluZyB3aGVuIHNoYXJlZCBoY3R4IG1hcmtl
ZCBmb3IgcmVzdGFydA0KPiAoQkxLX01RX1NfU0NIRURfUkVTVEFSVCBiaXQpIGJ1dCBxLT5zaGFy
ZWRfaGN0eF9yZXN0YXJ0IHN0YXlzIHplcm8uDQoNClNpbmNlIEkgaGF2ZSBiZWVuIGFibGUgdG8g
cmVwcm9kdWNlIHRoaXMgcXVldWUgc3RhbGwgd2l0aCBhbiB1cHN0cmVhbSBkcml2ZXINCihpYl9z
cnApIGFuZCBzaW5jZSBJIGFsc28gaGF2ZSBiZWVuIGFibGUgdG8gdmVyaWZ5IHRoYXQgdGhpcyBw
YXRjaCBmaXhlcyB0aGF0DQpzdGFsbDoNCg0KUmV2aWV3ZWQtYnk6IEJhcnQgVmFuIEFzc2NoZSA8
YmFydC52YW5hc3NjaGVAd2RjLmNvbT4NClRlc3RlZC1ieTogQmFydCBWYW4gQXNzY2hlIDxiYXJ0
LnZhbmFzc2NoZUB3ZGMuY29tPg==

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
@ 2017-11-07 15:54   ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2017-11-07 15:54 UTC (permalink / raw)
  To: roman.penyaev
  Cc: Bart Van Assche, linux-kernel, linux-block, hare, axboe, hch

On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
> the patch below fixes queue stalling when shared hctx marked for restart
> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.

Since I have been able to reproduce this queue stall with an upstream driver
(ib_srp) and since I also have been able to verify that this patch fixes that
stall:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-07 15:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-18 10:22 [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart Roman Pen
2017-10-19 17:47 ` Bart Van Assche
2017-10-19 17:47   ` Bart Van Assche
2017-10-20  9:39   ` Roman Penyaev
2017-10-20 20:05     ` Bart Van Assche
2017-10-20 20:05       ` Bart Van Assche
2017-10-23 15:12       ` Roman Penyaev
2017-11-01 16:50         ` Bart Van Assche
2017-11-01 16:50           ` Bart Van Assche
     [not found] ` <20171020133943.GA31275@ming.t460p>
2017-10-23 16:12   ` Roman Penyaev
2017-10-24 10:04     ` Ming Lei
2017-11-07 15:54 ` Bart Van Assche
2017-11-07 15:54   ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.