All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] queue stall with blk-mq-sched
@ 2017-01-24 15:54 Hannes Reinecke
  2017-01-24 16:03 ` Jens Axboe
  2017-01-24 16:09 ` Jens Axboe
  0 siblings, 2 replies; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-24 15:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

Hi Jens,

I'm trying to debug a queue stall with your blk-mq-sched branch; with my
latest mpt3sas patches fio stops basically directly after starting a
sequential read :-(

I've debugged things and came up with the attached patch; we need to
restart waiters with blk_mq_tag_idle() after completing a tag.
We're already calling blk_mq_tag_busy() when fetching a tag, so I think
calling blk_mq_tag_idle() is required when retiring a tag.

However, even with the attached patch I'm seeing some queue stalls;
looks like they're related to the 'stonewall' statement in fio.

Debugging continues.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-block-mq-fixup-queue-stall.patch --]
[-- Type: text/x-patch; name="0001-block-mq-fixup-queue-stall.patch", Size: 1440 bytes --]

From 82b15ff40d71aed318f9946881825f9f03ef8f48 Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Tue, 24 Jan 2017 14:43:09 +0100
Subject: [PATCH] block-mq: fixup queue stall

__blk_mq_alloc_request() calls blk_mq_tag_busy(), which might result
in the queue to become blocked. So we need to call blk_mq_tag_idle()
once the tag is finished to wakeup all waiters on the queue.

Patch is relative to the blk-mq-sched branch
 
Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 739a292..d52bcb1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -333,10 +333,12 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 {
 	const int sched_tag = rq->internal_tag;
 	struct request_queue *q = rq->q;
+	bool unbusy = false;
 
-	if (rq->rq_flags & RQF_MQ_INFLIGHT)
+	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		atomic_dec(&hctx->nr_active);
-
+		unbusy = true;
+	}
 	wbt_done(q->rq_wb, &rq->issue_stat);
 	rq->rq_flags = 0;
 
@@ -346,6 +348,9 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	if (unbusy)
+		blk_mq_tag_idle(hctx);
+
 	blk_queue_exit(q);
 }
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 15:54 [PATCH] queue stall with blk-mq-sched Hannes Reinecke
@ 2017-01-24 16:03 ` Jens Axboe
  2017-01-24 18:45   ` Hannes Reinecke
  2017-01-24 16:09 ` Jens Axboe
  1 sibling, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-24 16:03 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
> Hi Jens,
> 
> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
> latest mpt3sas patches fio stops basically directly after starting a
> sequential read :-(
> 
> I've debugged things and came up with the attached patch; we need to
> restart waiters with blk_mq_tag_idle() after completing a tag.
> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
> calling blk_mq_tag_idle() is required when retiring a tag.

I'll take a look at this. It sounds like all your grief is related to
shared tag maps, which I don't have anything that uses here. I'll see
if we are leaking it, you should be able to check that by reading the
'active' file in the sysfs directory.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 15:54 [PATCH] queue stall with blk-mq-sched Hannes Reinecke
  2017-01-24 16:03 ` Jens Axboe
@ 2017-01-24 16:09 ` Jens Axboe
  2017-01-24 18:49   ` Hannes Reinecke
  1 sibling, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-24 16:09 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
> Hi Jens,
> 
> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
> latest mpt3sas patches fio stops basically directly after starting a
> sequential read :-(
> 
> I've debugged things and came up with the attached patch; we need to
> restart waiters with blk_mq_tag_idle() after completing a tag.
> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
> calling blk_mq_tag_idle() is required when retiring a tag.

The patch isn't correct, the whole point of the un-idling is that it
ISN'T happening for every request completion. Otherwise you throw
away scalability. So a queue will go into active mode on the first
request, and idle when it's been idle for a bit. The active count
is used to divide up the tags.

So I'm assuming we're missing a queue run somewhere when we fail
getting a driver tag. The latter should only happen if the target
has IO in flight already, and the restart marking should take care
of it. Obviously there's a case where that is not true, since you
are seeing stalls.

> However, even with the attached patch I'm seeing some queue stalls;
> looks like they're related to the 'stonewall' statement in fio.

I think you are heading down the wrong path. Your patch will cause
the symptoms to be a bit different, but you'll still run into cases
where we fail giving out the tag and then stall.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 16:03 ` Jens Axboe
@ 2017-01-24 18:45   ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-24 18:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/24/2017 05:03 PM, Jens Axboe wrote:
> On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
>> Hi Jens,
>>
>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
>> latest mpt3sas patches fio stops basically directly after starting a
>> sequential read :-(
>>
>> I've debugged things and came up with the attached patch; we need to
>> restart waiters with blk_mq_tag_idle() after completing a tag.
>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
>> calling blk_mq_tag_idle() is required when retiring a tag.
>
> I'll take a look at this. It sounds like all your grief is related to
> shared tag maps, which I don't have anything that uses here. I'll see
> if we are leaking it, you should be able to check that by reading the
> 'active' file in the sysfs directory.
>
Ah. That'll explain it.

Basically _all_ my HBAs I'm testing with have shared tag maps :-(

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 16:09 ` Jens Axboe
@ 2017-01-24 18:49   ` Hannes Reinecke
  2017-01-24 19:55     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-24 18:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/24/2017 05:09 PM, Jens Axboe wrote:
> On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
>> Hi Jens,
>>
>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
>> latest mpt3sas patches fio stops basically directly after starting a
>> sequential read :-(
>>
>> I've debugged things and came up with the attached patch; we need to
>> restart waiters with blk_mq_tag_idle() after completing a tag.
>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
>> calling blk_mq_tag_idle() is required when retiring a tag.
>
> The patch isn't correct, the whole point of the un-idling is that it
> ISN'T happening for every request completion. Otherwise you throw
> away scalability. So a queue will go into active mode on the first
> request, and idle when it's been idle for a bit. The active count
> is used to divide up the tags.
>
> So I'm assuming we're missing a queue run somewhere when we fail
> getting a driver tag. The latter should only happen if the target
> has IO in flight already, and the restart marking should take care
> of it. Obviously there's a case where that is not true, since you
> are seeing stalls.
>
But what is the point in the 'blk_mq_tag_busy()' thingie then?
When will it be reset?
The only instances I've seen is that it'll be getting reset during 
resize and teardown ... hence my patch.

>> However, even with the attached patch I'm seeing some queue stalls;
>> looks like they're related to the 'stonewall' statement in fio.
>
> I think you are heading down the wrong path. Your patch will cause
> the symptoms to be a bit different, but you'll still run into cases
> where we fail giving out the tag and then stall.
>
Hehe.
How did you know that?

That's indeed what I'm seeing.

Oh well, back to the drawing board...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 18:49   ` Hannes Reinecke
@ 2017-01-24 19:55     ` Jens Axboe
  2017-01-24 22:06       ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-24 19:55 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/24/2017 11:49 AM, Hannes Reinecke wrote:
> On 01/24/2017 05:09 PM, Jens Axboe wrote:
>> On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
>>> Hi Jens,
>>>
>>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
>>> latest mpt3sas patches fio stops basically directly after starting a
>>> sequential read :-(
>>>
>>> I've debugged things and came up with the attached patch; we need to
>>> restart waiters with blk_mq_tag_idle() after completing a tag.
>>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
>>> calling blk_mq_tag_idle() is required when retiring a tag.
>>
>> The patch isn't correct, the whole point of the un-idling is that it
>> ISN'T happening for every request completion. Otherwise you throw
>> away scalability. So a queue will go into active mode on the first
>> request, and idle when it's been idle for a bit. The active count
>> is used to divide up the tags.
>>
>> So I'm assuming we're missing a queue run somewhere when we fail
>> getting a driver tag. The latter should only happen if the target
>> has IO in flight already, and the restart marking should take care
>> of it. Obviously there's a case where that is not true, since you
>> are seeing stalls.
>>
> But what is the point in the 'blk_mq_tag_busy()' thingie then?
> When will it be reset?
> The only instances I've seen is that it'll be getting reset during 
> resize and teardown ... hence my patch.

The point is to have some count of how many queues are busy "lately",
which helps in dividing up the tags fairly. Hence we bump it as soon as
the queue goes active, and drop it after some delay. That's working as
expected.

>>> However, even with the attached patch I'm seeing some queue stalls;
>>> looks like they're related to the 'stonewall' statement in fio.
>>
>> I think you are heading down the wrong path. Your patch will cause
>> the symptoms to be a bit different, but you'll still run into cases
>> where we fail giving out the tag and then stall.
>>
> Hehe.
> How did you know that?

My crystal ball :-)

> That's indeed what I'm seeing.
> 
> Oh well, back to the drawing board...

Try this patch. We only want to bump it for the driver tags, not the
scheduler side.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..c905aa1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,15 +230,15 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 
 		rq = tags->static_rqs[tag];
 
-		if (blk_mq_tag_busy(data->hctx)) {
-			rq->rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-
 		if (data->flags & BLK_MQ_REQ_INTERNAL) {
 			rq->tag = -1;
 			rq->internal_tag = tag;
 		} else {
+			if (blk_mq_tag_busy(data->hctx)) {
+				rq->rq_flags = RQF_MQ_INFLIGHT;
+				atomic_inc(&data->hctx->nr_active);
+			}
+
 			rq->tag = tag;
 			rq->internal_tag = -1;
 		}
@@ -870,6 +870,10 @@ static bool blk_mq_get_driver_tag(struct request *rq,
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
 		data.hctx->tags->rqs[rq->tag] = rq;
+		if (blk_mq_tag_busy(data.hctx)) {
+			rq->rq_flags |= RQF_MQ_INFLIGHT;
+			atomic_inc(&data.hctx->nr_active);
+		}
 		goto done;
 	}
 

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 19:55     ` Jens Axboe
@ 2017-01-24 22:06       ` Jens Axboe
  2017-01-25  7:39         ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-24 22:06 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/24/2017 12:55 PM, Jens Axboe wrote:
> Try this patch. We only want to bump it for the driver tags, not the
> scheduler side.

More complete version, this one actually tested. I think this should fix
your issue, let me know.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a49ec77..1b156ca 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -90,9 +90,11 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	return atomic_read(&hctx->nr_active) < depth;
 }
 
-static int __blk_mq_get_tag(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt)
+static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
+			    struct sbitmap_queue *bt)
 {
-	if (!hctx_may_queue(hctx, bt))
+	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
+	    !hctx_may_queue(data->hctx, bt))
 		return -1;
 	return __sbitmap_queue_get(bt);
 }
@@ -118,7 +120,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		tag_offset = tags->nr_reserved_tags;
 	}
 
-	tag = __blk_mq_get_tag(data->hctx, bt);
+	tag = __blk_mq_get_tag(data, bt);
 	if (tag != -1)
 		goto found_tag;
 
@@ -129,7 +131,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	do {
 		prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-		tag = __blk_mq_get_tag(data->hctx, bt);
+		tag = __blk_mq_get_tag(data, bt);
 		if (tag != -1)
 			break;
 
@@ -144,7 +146,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * Retry tag allocation after running the hardware queue,
 		 * as running the queue may also have found completions.
 		 */
-		tag = __blk_mq_get_tag(data->hctx, bt);
+		tag = __blk_mq_get_tag(data, bt);
 		if (tag != -1)
 			break;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..dcb5676 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,15 +230,14 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
 
 		rq = tags->static_rqs[tag];
 
-		if (blk_mq_tag_busy(data->hctx)) {
-			rq->rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-
 		if (data->flags & BLK_MQ_REQ_INTERNAL) {
 			rq->tag = -1;
 			rq->internal_tag = tag;
 		} else {
+			if (blk_mq_tag_busy(data->hctx)) {
+				rq->rq_flags = RQF_MQ_INFLIGHT;
+				atomic_inc(&data->hctx->nr_active);
+			}
 			rq->tag = tag;
 			rq->internal_tag = -1;
 		}
@@ -869,6 +868,10 @@ static bool blk_mq_get_driver_tag(struct request *rq,
 
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
+		if (blk_mq_tag_busy(data.hctx)) {
+			rq->rq_flags |= RQF_MQ_INFLIGHT;
+			atomic_inc(&data.hctx->nr_active);
+		}
 		data.hctx->tags->rqs[rq->tag] = rq;
 		goto done;
 	}

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-24 22:06       ` Jens Axboe
@ 2017-01-25  7:39         ` Hannes Reinecke
  2017-01-25  8:07           ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-25  7:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/24/2017 11:06 PM, Jens Axboe wrote:
> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>> Try this patch. We only want to bump it for the driver tags, not the
>> scheduler side.
> 
> More complete version, this one actually tested. I think this should fix
> your issue, let me know.
> 
Nearly there.
The initial stall is gone, but the test got hung at the 'stonewall'
sequence again:

[global]
bs=4k
ioengine=libaio
iodepth=256
size=4g
direct=1
runtime=60
# directory=/mnt
numjobs=32
group_reporting
cpus_allowed_policy=split
filename=/dev/md127

[seq-read]
rw=read
-> stonewall

[rand-read]
rw=randread
stonewall

Restarting all queues made the fio job continue.
There were 4 queues with state 'restart', and one queue with state 'active'.
So we're missing a queue run somewhere.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25  7:39         ` Hannes Reinecke
@ 2017-01-25  8:07           ` Hannes Reinecke
  2017-01-25 11:10             ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-25  8:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>> Try this patch. We only want to bump it for the driver tags, not the
>>> scheduler side.
>>
>> More complete version, this one actually tested. I think this should fix
>> your issue, let me know.
>>
> Nearly there.
> The initial stall is gone, but the test got hung at the 'stonewall'
> sequence again:
> 
> [global]
> bs=4k
> ioengine=libaio
> iodepth=256
> size=4g
> direct=1
> runtime=60
> # directory=/mnt
> numjobs=32
> group_reporting
> cpus_allowed_policy=split
> filename=/dev/md127
> 
> [seq-read]
> rw=read
> -> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> Restarting all queues made the fio job continue.
> There were 4 queues with state 'restart', and one queue with state 'active'.
> So we're missing a queue run somewhere.
> 
I've found the queue stalls are gone with this patch:

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..de5db6c 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
request_queue *q,
 }

 static inline void
+blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
+{
+       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+               blk_mq_run_hw_queue(hctx, true);
+       }
+}
+
+static inline void
 blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
request *rq)
 {
        struct elevator_queue *e = hctx->queue->elevator;
@@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
request_queue *q,
        BUG_ON(rq->internal_tag == -1);

        blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
rq->internal_tag);
-
-       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-               blk_mq_run_hw_queue(hctx, true);
-       }
 }

 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e872555..63799ad 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
*hctx, struct blk_mq_ctx *ctx,
                blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
        if (sched_tag != -1)
                blk_mq_sched_completed_request(hctx, rq);
+       blk_mq_sched_restart(hctx);
        blk_queue_exit(q);
 }

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25  8:07           ` Hannes Reinecke
@ 2017-01-25 11:10             ` Hannes Reinecke
  2017-01-25 15:52               ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-25 11:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>>> Try this patch. We only want to bump it for the driver tags, not the
>>>> scheduler side.
>>>
>>> More complete version, this one actually tested. I think this should fix
>>> your issue, let me know.
>>>
>> Nearly there.
>> The initial stall is gone, but the test got hung at the 'stonewall'
>> sequence again:
>>
>> [global]
>> bs=4k
>> ioengine=libaio
>> iodepth=256
>> size=4g
>> direct=1
>> runtime=60
>> # directory=/mnt
>> numjobs=32
>> group_reporting
>> cpus_allowed_policy=split
>> filename=/dev/md127
>>
>> [seq-read]
>> rw=read
>> -> stonewall
>>
>> [rand-read]
>> rw=randread
>> stonewall
>>
>> Restarting all queues made the fio job continue.
>> There were 4 queues with state 'restart', and one queue with state 'active'.
>> So we're missing a queue run somewhere.
>>
> I've found the queue stalls are gone with this patch:
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 6b465bc..de5db6c 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
>  }
> 
>  static inline void
> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> +{
> +       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> +               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +               blk_mq_run_hw_queue(hctx, true);
> +       }
> +}
> +
> +static inline void
>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
> request *rq)
>  {
>         struct elevator_queue *e = hctx->queue->elevator;
> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
>         BUG_ON(rq->internal_tag == -1);
> 
>         blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
> rq->internal_tag);
> -
> -       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> -               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> -               blk_mq_run_hw_queue(hctx, true);
> -       }
>  }
> 
>  static inline void blk_mq_sched_started_request(struct request *rq)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..63799ad 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
> *hctx, struct blk_mq_ctx *ctx,
>                 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>         if (sched_tag != -1)
>                 blk_mq_sched_completed_request(hctx, rq);
> +       blk_mq_sched_restart(hctx);
>         blk_queue_exit(q);
>  }
> 
Bah.

Not quite. I'm still seeing some queues with state 'restart'.

I've found that I need another patch on top of that:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e872555..edcbb44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
*work)

                queue_for_each_hw_ctx(q, hctx, i) {
                        /* the hctx may be unmapped, so check it here */
-                       if (blk_mq_hw_queue_mapped(hctx))
+                       if (blk_mq_hw_queue_mapped(hctx)) {
                                blk_mq_tag_idle(hctx);
+                               blk_mq_sched_restart(hctx);
+                       }
                }
        }
        blk_queue_exit(q);


Reasoning is that in blk_mq_get_tag() we might end up scheduling the
request on another hctx, but the original hctx might still have the
SCHED_RESTART bit set.
Which will never cleared as we complete the request on a different hctx,
so anything we do on the end_request side won't do us any good.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 11:10             ` Hannes Reinecke
@ 2017-01-25 15:52               ` Jens Axboe
  2017-01-25 16:57                 ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-25 15:52 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
> On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
>> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>>>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>>>> Try this patch. We only want to bump it for the driver tags, not the
>>>>> scheduler side.
>>>>
>>>> More complete version, this one actually tested. I think this should fix
>>>> your issue, let me know.
>>>>
>>> Nearly there.
>>> The initial stall is gone, but the test got hung at the 'stonewall'
>>> sequence again:
>>>
>>> [global]
>>> bs=4k
>>> ioengine=libaio
>>> iodepth=256
>>> size=4g
>>> direct=1
>>> runtime=60
>>> # directory=/mnt
>>> numjobs=32
>>> group_reporting
>>> cpus_allowed_policy=split
>>> filename=/dev/md127
>>>
>>> [seq-read]
>>> rw=read
>>> -> stonewall
>>>
>>> [rand-read]
>>> rw=randread
>>> stonewall
>>>
>>> Restarting all queues made the fio job continue.
>>> There were 4 queues with state 'restart', and one queue with state 'active'.
>>> So we're missing a queue run somewhere.
>>>
>> I've found the queue stalls are gone with this patch:
>>
>> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
>> index 6b465bc..de5db6c 100644
>> --- a/block/blk-mq-sched.h
>> +++ b/block/blk-mq-sched.h
>> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>>  }
>>
>>  static inline void
>> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>> +{
>> +       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
>> +               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>> +               blk_mq_run_hw_queue(hctx, true);
>> +       }
>> +}
>> +
>> +static inline void
>>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
>> request *rq)
>>  {
>>         struct elevator_queue *e = hctx->queue->elevator;
>> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>>         BUG_ON(rq->internal_tag == -1);
>>
>>         blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
>> rq->internal_tag);
>> -
>> -       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
>> -               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>> -               blk_mq_run_hw_queue(hctx, true);
>> -       }
>>  }
>>
>>  static inline void blk_mq_sched_started_request(struct request *rq)
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e872555..63799ad 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
>> *hctx, struct blk_mq_ctx *ctx,
>>                 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>>         if (sched_tag != -1)
>>                 blk_mq_sched_completed_request(hctx, rq);
>> +       blk_mq_sched_restart(hctx);
>>         blk_queue_exit(q);
>>  }
>>
> Bah.
> 
> Not quite. I'm still seeing some queues with state 'restart'.
> 
> I've found that I need another patch on top of that:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..edcbb44 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
> *work)
> 
>                 queue_for_each_hw_ctx(q, hctx, i) {
>                         /* the hctx may be unmapped, so check it here */
> -                       if (blk_mq_hw_queue_mapped(hctx))
> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>                                 blk_mq_tag_idle(hctx);
> +                               blk_mq_sched_restart(hctx);
> +                       }
>                 }
>         }
>         blk_queue_exit(q);
> 
> 
> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
> request on another hctx, but the original hctx might still have the
> SCHED_RESTART bit set.
> Which will never cleared as we complete the request on a different hctx,
> so anything we do on the end_request side won't do us any good.

I think you are right, it'll potentially trigger with shared tags and
multiple hardware queues. I'll debug this today and come up with a
decent fix.

I committed the previous patch, fwiw.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 15:52               ` Jens Axboe
@ 2017-01-25 16:57                 ` Hannes Reinecke
  2017-01-25 17:03                   ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-25 16:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/25/2017 04:52 PM, Jens Axboe wrote:
> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
[ .. ]
>> Bah.
>>
>> Not quite. I'm still seeing some queues with state 'restart'.
>>
>> I've found that I need another patch on top of that:
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e872555..edcbb44 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>> *work)
>>
>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>                         /* the hctx may be unmapped, so check it here */
>> -                       if (blk_mq_hw_queue_mapped(hctx))
>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>                                 blk_mq_tag_idle(hctx);
>> +                               blk_mq_sched_restart(hctx);
>> +                       }
>>                 }
>>         }
>>         blk_queue_exit(q);
>>
>>
>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>> request on another hctx, but the original hctx might still have the
>> SCHED_RESTART bit set.
>> Which will never cleared as we complete the request on a different hctx,
>> so anything we do on the end_request side won't do us any good.
>
> I think you are right, it'll potentially trigger with shared tags and
> multiple hardware queues. I'll debug this today and come up with a
> decent fix.
>
> I committed the previous patch, fwiw.
>
THX.

The above patch _does_ help in the sense that my testcase now completes 
without stalls. And I even get a decent performance with the mq-sched 
fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
when running without I/O scheduling.
Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
getting there.

However, I do get a noticeable stall during the stonewall sequence 
before the timeout handler kicks in, so the must be a better way for 
handling this.

But nevertheless, thanks for all your work here.
Very much appreciated.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 16:57                 ` Hannes Reinecke
@ 2017-01-25 17:03                   ` Jens Axboe
  2017-01-25 17:42                     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-25 17:03 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
> [ .. ]
>>> Bah.
>>>
>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>
>>> I've found that I need another patch on top of that:
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index e872555..edcbb44 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>> *work)
>>>
>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>                         /* the hctx may be unmapped, so check it here */
>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>                                 blk_mq_tag_idle(hctx);
>>> +                               blk_mq_sched_restart(hctx);
>>> +                       }
>>>                 }
>>>         }
>>>         blk_queue_exit(q);
>>>
>>>
>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>> request on another hctx, but the original hctx might still have the
>>> SCHED_RESTART bit set.
>>> Which will never cleared as we complete the request on a different hctx,
>>> so anything we do on the end_request side won't do us any good.
>>
>> I think you are right, it'll potentially trigger with shared tags and
>> multiple hardware queues. I'll debug this today and come up with a
>> decent fix.
>>
>> I committed the previous patch, fwiw.
>>
> THX.
> 
> The above patch _does_ help in the sense that my testcase now completes 
> without stalls. And I even get a decent performance with the mq-sched 
> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
> when running without I/O scheduling.
> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
> getting there.
> 
> However, I do get a noticeable stall during the stonewall sequence 
> before the timeout handler kicks in, so the must be a better way for 
> handling this.
> 
> But nevertheless, thanks for all your work here.
> Very much appreciated.

Yeah, the fix isn't really a fix, unless you are willing to tolerate
potentially tens of seconds of extra latency until we idle it out :-)

So we can't use the un-idling for this, but we can track it on the
shared state, which is the tags. The problem isn't that we are
switching to a new hardware queue, it's if we mark the hardware queue
as restart AND it has nothing pending. In that case, we'll never
get it restarted, since IO completion is what restarts it.

I need to handle that case separately. Currently testing a patch, I
should have something for you to test later today.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 17:03                   ` Jens Axboe
@ 2017-01-25 17:42                     ` Jens Axboe
  2017-01-25 22:27                       ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-25 17:42 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/25/2017 10:03 AM, Jens Axboe wrote:
> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>> [ .. ]
>>>> Bah.
>>>>
>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>
>>>> I've found that I need another patch on top of that:
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index e872555..edcbb44 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>> *work)
>>>>
>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>                         /* the hctx may be unmapped, so check it here */
>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>                                 blk_mq_tag_idle(hctx);
>>>> +                               blk_mq_sched_restart(hctx);
>>>> +                       }
>>>>                 }
>>>>         }
>>>>         blk_queue_exit(q);
>>>>
>>>>
>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>> request on another hctx, but the original hctx might still have the
>>>> SCHED_RESTART bit set.
>>>> Which will never cleared as we complete the request on a different hctx,
>>>> so anything we do on the end_request side won't do us any good.
>>>
>>> I think you are right, it'll potentially trigger with shared tags and
>>> multiple hardware queues. I'll debug this today and come up with a
>>> decent fix.
>>>
>>> I committed the previous patch, fwiw.
>>>
>> THX.
>>
>> The above patch _does_ help in the sense that my testcase now completes 
>> without stalls. And I even get a decent performance with the mq-sched 
>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>> when running without I/O scheduling.
>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>> getting there.
>>
>> However, I do get a noticeable stall during the stonewall sequence 
>> before the timeout handler kicks in, so the must be a better way for 
>> handling this.
>>
>> But nevertheless, thanks for all your work here.
>> Very much appreciated.
> 
> Yeah, the fix isn't really a fix, unless you are willing to tolerate
> potentially tens of seconds of extra latency until we idle it out :-)
> 
> So we can't use the un-idling for this, but we can track it on the
> shared state, which is the tags. The problem isn't that we are
> switching to a new hardware queue, it's if we mark the hardware queue
> as restart AND it has nothing pending. In that case, we'll never
> get it restarted, since IO completion is what restarts it.
> 
> I need to handle that case separately. Currently testing a patch, I
> should have something for you to test later today.

Can you try this one?


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..6a1656d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -300,6 +300,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (blk_mq_hctx_has_pending(hctx))
+			blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int i;
+
+	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+		blk_mq_sched_restart_hctx(hctx);
+	else {
+		struct request_queue *q = hctx->queue;
+
+		if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+			return;
+
+		clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+
+		queue_for_each_hw_ctx(q, hctx, i)
+			blk_mq_sched_restart_hctx(hctx);
+	}
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -123,11 +124,6 @@ blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	BUG_ON(rq->internal_tag == -1);
 
 	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
-
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		blk_mq_run_hw_queue(hctx, true);
-	}
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
@@ -160,8 +156,15 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_sched_mark_restart(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
+			struct request_queue *q = hctx->queue;
+
+			if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+				set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+		}
+	}
 }
 
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c3e667..3951b72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,7 +40,7 @@ static LIST_HEAD(all_q_list);
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
-static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return sbitmap_any_bit_set(&hctx->ctx_map) ||
 			!list_empty_careful(&hctx->dispatch) ||
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
 
@@ -928,8 +929,16 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
 			if (!queued && reorder_tags_to_front(list))
 				continue;
+
+			/*
+			 * We failed getting a driver tag. Mark the queue(s)
+			 * as needing a restart. Retry getting a tag again,
+			 * in case the needed IO completed right before we
+			 * marked the queue as needing a restart.
+			 */
 			blk_mq_sched_mark_restart(hctx);
-			break;
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				break;
 		}
 		list_del_init(&rq->queuelist);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c24b90..077a400 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,6 +33,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ee1fb59..40ce491 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -607,6 +607,7 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
+#define QUEUE_FLAG_RESTART     28
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 17:42                     ` Jens Axboe
@ 2017-01-25 22:27                       ` Jens Axboe
  2017-01-26 16:35                         ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2017-01-25 22:27 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/25/2017 10:42 AM, Jens Axboe wrote:
> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>> [ .. ]
>>>>> Bah.
>>>>>
>>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>>
>>>>> I've found that I need another patch on top of that:
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index e872555..edcbb44 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>>> *work)
>>>>>
>>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>>                         /* the hctx may be unmapped, so check it here */
>>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>>                                 blk_mq_tag_idle(hctx);
>>>>> +                               blk_mq_sched_restart(hctx);
>>>>> +                       }
>>>>>                 }
>>>>>         }
>>>>>         blk_queue_exit(q);
>>>>>
>>>>>
>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>>> request on another hctx, but the original hctx might still have the
>>>>> SCHED_RESTART bit set.
>>>>> Which will never cleared as we complete the request on a different hctx,
>>>>> so anything we do on the end_request side won't do us any good.
>>>>
>>>> I think you are right, it'll potentially trigger with shared tags and
>>>> multiple hardware queues. I'll debug this today and come up with a
>>>> decent fix.
>>>>
>>>> I committed the previous patch, fwiw.
>>>>
>>> THX.
>>>
>>> The above patch _does_ help in the sense that my testcase now completes 
>>> without stalls. And I even get a decent performance with the mq-sched 
>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>> when running without I/O scheduling.
>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>> getting there.
>>>
>>> However, I do get a noticeable stall during the stonewall sequence 
>>> before the timeout handler kicks in, so the must be a better way for 
>>> handling this.
>>>
>>> But nevertheless, thanks for all your work here.
>>> Very much appreciated.
>>
>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>> potentially tens of seconds of extra latency until we idle it out :-)
>>
>> So we can't use the un-idling for this, but we can track it on the
>> shared state, which is the tags. The problem isn't that we are
>> switching to a new hardware queue, it's if we mark the hardware queue
>> as restart AND it has nothing pending. In that case, we'll never
>> get it restarted, since IO completion is what restarts it.
>>
>> I need to handle that case separately. Currently testing a patch, I
>> should have something for you to test later today.
> 
> Can you try this one?

And another variant, this one should be better in that it should result
in less queue runs and get better merging. Hope it works with your
stalls as well.


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..6a1656d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -300,6 +300,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (blk_mq_hctx_has_pending(hctx))
+			blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int i;
+
+	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+		blk_mq_sched_restart_hctx(hctx);
+	else {
+		struct request_queue *q = hctx->queue;
+
+		if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+			return;
+
+		clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+
+		queue_for_each_hw_ctx(q, hctx, i)
+			blk_mq_sched_restart_hctx(hctx);
+	}
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -123,11 +124,6 @@ blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	BUG_ON(rq->internal_tag == -1);
 
 	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
-
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		blk_mq_run_hw_queue(hctx, true);
-	}
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
@@ -160,8 +156,15 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_sched_mark_restart(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
+			struct request_queue *q = hctx->queue;
+
+			if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+				set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+		}
+	}
 }
 
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c3e667..4eb732a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,7 +40,7 @@ static LIST_HEAD(all_q_list);
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
-static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return sbitmap_any_bit_set(&hctx->ctx_map) ||
 			!list_empty_careful(&hctx->dispatch) ||
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
 
@@ -879,6 +880,21 @@ static bool blk_mq_get_driver_tag(struct request *rq,
 	return false;
 }
 
+static void blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
+				  struct request *rq)
+{
+	if (rq->tag == -1 || rq->internal_tag == -1)
+		return;
+
+	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	rq->tag = -1;
+
+	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
+		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
+		atomic_dec(&hctx->nr_active);
+	}
+}
+
 /*
  * If we fail getting a driver tag because all the driver tags are already
  * assigned and on the dispatch list, BUT the first entry does not have a
@@ -928,8 +944,16 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
 			if (!queued && reorder_tags_to_front(list))
 				continue;
+
+			/*
+			 * We failed getting a driver tag. Mark the queue(s)
+			 * as needing a restart. Retry getting a tag again,
+			 * in case the needed IO completed right before we
+			 * marked the queue as needing a restart.
+			 */
 			blk_mq_sched_mark_restart(hctx);
-			break;
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				break;
 		}
 		list_del_init(&rq->queuelist);
 
@@ -943,6 +967,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 			queued++;
 			break;
 		case BLK_MQ_RQ_QUEUE_BUSY:
+			blk_mq_put_driver_tag(hctx, rq);
 			list_add(&rq->queuelist, list);
 			__blk_mq_requeue_request(rq);
 			break;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c24b90..077a400 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,6 +33,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ee1fb59..40ce491 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -607,6 +607,7 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
+#define QUEUE_FLAG_RESTART     28
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-25 22:27                       ` Jens Axboe
@ 2017-01-26 16:35                         ` Hannes Reinecke
  2017-01-26 16:42                           ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-26 16:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/25/2017 11:27 PM, Jens Axboe wrote:
> On 01/25/2017 10:42 AM, Jens Axboe wrote:
>> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>>> [ .. ]
>>>>>> Bah.
>>>>>>
>>>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>>>
>>>>>> I've found that I need another patch on top of that:
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index e872555..edcbb44 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>>>> *work)
>>>>>>
>>>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>                         /* the hctx may be unmapped, so check it here */
>>>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>>>                                 blk_mq_tag_idle(hctx);
>>>>>> +                               blk_mq_sched_restart(hctx);
>>>>>> +                       }
>>>>>>                 }
>>>>>>         }
>>>>>>         blk_queue_exit(q);
>>>>>>
>>>>>>
>>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>>>> request on another hctx, but the original hctx might still have the
>>>>>> SCHED_RESTART bit set.
>>>>>> Which will never cleared as we complete the request on a different hctx,
>>>>>> so anything we do on the end_request side won't do us any good.
>>>>>
>>>>> I think you are right, it'll potentially trigger with shared tags and
>>>>> multiple hardware queues. I'll debug this today and come up with a
>>>>> decent fix.
>>>>>
>>>>> I committed the previous patch, fwiw.
>>>>>
>>>> THX.
>>>>
>>>> The above patch _does_ help in the sense that my testcase now completes 
>>>> without stalls. And I even get a decent performance with the mq-sched 
>>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>>> when running without I/O scheduling.
>>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>>> getting there.
>>>>
>>>> However, I do get a noticeable stall during the stonewall sequence 
>>>> before the timeout handler kicks in, so the must be a better way for 
>>>> handling this.
>>>>
>>>> But nevertheless, thanks for all your work here.
>>>> Very much appreciated.
>>>
>>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>>> potentially tens of seconds of extra latency until we idle it out :-)
>>>
>>> So we can't use the un-idling for this, but we can track it on the
>>> shared state, which is the tags. The problem isn't that we are
>>> switching to a new hardware queue, it's if we mark the hardware queue
>>> as restart AND it has nothing pending. In that case, we'll never
>>> get it restarted, since IO completion is what restarts it.
>>>
>>> I need to handle that case separately. Currently testing a patch, I
>>> should have something for you to test later today.
>>
>> Can you try this one?
> 
> And another variant, this one should be better in that it should result
> in less queue runs and get better merging. Hope it works with your
> stalls as well.
> 
> 

Looking good; queue stalls are gone, and performance is okay-ish.
I'm getting 84k IOPs now, which is not bad.

But we absolutely need to work on I/O merging; with CFQ I'm seeing
requests having about double the size of those done by mq-deadline.
(Bit unfair, I know :-)

I'll be having some more data in time for LSF/MM.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-26 16:35                         ` Hannes Reinecke
@ 2017-01-26 16:42                           ` Jens Axboe
  2017-01-26 19:20                             ` Jens Axboe
  2017-01-27  6:58                             ` Hannes Reinecke
  0 siblings, 2 replies; 19+ messages in thread
From: Jens Axboe @ 2017-01-26 16:42 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/26/2017 09:35 AM, Hannes Reinecke wrote:
> On 01/25/2017 11:27 PM, Jens Axboe wrote:
>> On 01/25/2017 10:42 AM, Jens Axboe wrote:
>>> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>>>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>>>> [ .. ]
>>>>>>> Bah.
>>>>>>>
>>>>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>>>>
>>>>>>> I've found that I need another patch on top of that:
>>>>>>>
>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>> index e872555..edcbb44 100644
>>>>>>> --- a/block/blk-mq.c
>>>>>>> +++ b/block/blk-mq.c
>>>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>>>>> *work)
>>>>>>>
>>>>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>>                         /* the hctx may be unmapped, so check it here */
>>>>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>>>>                                 blk_mq_tag_idle(hctx);
>>>>>>> +                               blk_mq_sched_restart(hctx);
>>>>>>> +                       }
>>>>>>>                 }
>>>>>>>         }
>>>>>>>         blk_queue_exit(q);
>>>>>>>
>>>>>>>
>>>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>>>>> request on another hctx, but the original hctx might still have the
>>>>>>> SCHED_RESTART bit set.
>>>>>>> Which will never cleared as we complete the request on a different hctx,
>>>>>>> so anything we do on the end_request side won't do us any good.
>>>>>>
>>>>>> I think you are right, it'll potentially trigger with shared tags and
>>>>>> multiple hardware queues. I'll debug this today and come up with a
>>>>>> decent fix.
>>>>>>
>>>>>> I committed the previous patch, fwiw.
>>>>>>
>>>>> THX.
>>>>>
>>>>> The above patch _does_ help in the sense that my testcase now completes 
>>>>> without stalls. And I even get a decent performance with the mq-sched 
>>>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>>>> when running without I/O scheduling.
>>>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>>>> getting there.
>>>>>
>>>>> However, I do get a noticeable stall during the stonewall sequence 
>>>>> before the timeout handler kicks in, so the must be a better way for 
>>>>> handling this.
>>>>>
>>>>> But nevertheless, thanks for all your work here.
>>>>> Very much appreciated.
>>>>
>>>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>>>> potentially tens of seconds of extra latency until we idle it out :-)
>>>>
>>>> So we can't use the un-idling for this, but we can track it on the
>>>> shared state, which is the tags. The problem isn't that we are
>>>> switching to a new hardware queue, it's if we mark the hardware queue
>>>> as restart AND it has nothing pending. In that case, we'll never
>>>> get it restarted, since IO completion is what restarts it.
>>>>
>>>> I need to handle that case separately. Currently testing a patch, I
>>>> should have something for you to test later today.
>>>
>>> Can you try this one?
>>
>> And another variant, this one should be better in that it should result
>> in less queue runs and get better merging. Hope it works with your
>> stalls as well.
>>
>>
> 
> Looking good; queue stalls are gone, and performance is okay-ish.
> I'm getting 84k IOPs now, which is not bad.

Is that a tested-by?

> But we absolutely need to work on I/O merging; with CFQ I'm seeing
> requests having about double the size of those done by mq-deadline.
> (Bit unfair, I know :-)
> 
> I'll be having some more data in time for LSF/MM.

I agree, looking at the performance delta, it's all about merging. It's
fairly easy to observe with mq-deadline, as merging rates drop
proportionally to the number of queues configured. But even with 1 queue
with scsi-mq, we're still seeing lower merging rates than !mq +
deadline, for instance.

I'll look at the merging case, it should not be that hard to bring at
least the single queue case to parity with !mq. I'm actually surprised
it isn't already.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-26 16:42                           ` Jens Axboe
@ 2017-01-26 19:20                             ` Jens Axboe
  2017-01-27  6:58                             ` Hannes Reinecke
  1 sibling, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2017-01-26 19:20 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-block, Omar Sandoval

On 01/26/2017 09:42 AM, Jens Axboe wrote:
> On 01/26/2017 09:35 AM, Hannes Reinecke wrote:
>> On 01/25/2017 11:27 PM, Jens Axboe wrote:
>>> On 01/25/2017 10:42 AM, Jens Axboe wrote:
>>>> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>>>>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>>>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>>>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>>>>> [ .. ]
>>>>>>>> Bah.
>>>>>>>>
>>>>>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>>>>>
>>>>>>>> I've found that I need another patch on top of that:
>>>>>>>>
>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>>> index e872555..edcbb44 100644
>>>>>>>> --- a/block/blk-mq.c
>>>>>>>> +++ b/block/blk-mq.c
>>>>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>>>>>> *work)
>>>>>>>>
>>>>>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>>>                         /* the hctx may be unmapped, so check it here */
>>>>>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>>>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>>>>>                                 blk_mq_tag_idle(hctx);
>>>>>>>> +                               blk_mq_sched_restart(hctx);
>>>>>>>> +                       }
>>>>>>>>                 }
>>>>>>>>         }
>>>>>>>>         blk_queue_exit(q);
>>>>>>>>
>>>>>>>>
>>>>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>>>>>> request on another hctx, but the original hctx might still have the
>>>>>>>> SCHED_RESTART bit set.
>>>>>>>> Which will never cleared as we complete the request on a different hctx,
>>>>>>>> so anything we do on the end_request side won't do us any good.
>>>>>>>
>>>>>>> I think you are right, it'll potentially trigger with shared tags and
>>>>>>> multiple hardware queues. I'll debug this today and come up with a
>>>>>>> decent fix.
>>>>>>>
>>>>>>> I committed the previous patch, fwiw.
>>>>>>>
>>>>>> THX.
>>>>>>
>>>>>> The above patch _does_ help in the sense that my testcase now completes 
>>>>>> without stalls. And I even get a decent performance with the mq-sched 
>>>>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>>>>> when running without I/O scheduling.
>>>>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>>>>> getting there.
>>>>>>
>>>>>> However, I do get a noticeable stall during the stonewall sequence 
>>>>>> before the timeout handler kicks in, so the must be a better way for 
>>>>>> handling this.
>>>>>>
>>>>>> But nevertheless, thanks for all your work here.
>>>>>> Very much appreciated.
>>>>>
>>>>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>>>>> potentially tens of seconds of extra latency until we idle it out :-)
>>>>>
>>>>> So we can't use the un-idling for this, but we can track it on the
>>>>> shared state, which is the tags. The problem isn't that we are
>>>>> switching to a new hardware queue, it's if we mark the hardware queue
>>>>> as restart AND it has nothing pending. In that case, we'll never
>>>>> get it restarted, since IO completion is what restarts it.
>>>>>
>>>>> I need to handle that case separately. Currently testing a patch, I
>>>>> should have something for you to test later today.
>>>>
>>>> Can you try this one?
>>>
>>> And another variant, this one should be better in that it should result
>>> in less queue runs and get better merging. Hope it works with your
>>> stalls as well.
>>>
>>>
>>
>> Looking good; queue stalls are gone, and performance is okay-ish.
>> I'm getting 84k IOPs now, which is not bad.
> 
> Is that a tested-by?
> 
>> But we absolutely need to work on I/O merging; with CFQ I'm seeing
>> requests having about double the size of those done by mq-deadline.
>> (Bit unfair, I know :-)
>>
>> I'll be having some more data in time for LSF/MM.
> 
> I agree, looking at the performance delta, it's all about merging. It's
> fairly easy to observe with mq-deadline, as merging rates drop
> proportionally to the number of queues configured. But even with 1 queue
> with scsi-mq, we're still seeing lower merging rates than !mq +
> deadline, for instance.
> 
> I'll look at the merging case, it should not be that hard to bring at
> least the single queue case to parity with !mq. I'm actually surprised
> it isn't already.

Can you give this a whirl? It's basically the same as the previous
patch, but it dispatches single requests at the time instead of
pulling everything off the queue. That could have an impact on merging.
Merge rates should go up with this patch, and I believe it's the merge
rates that are causing the lower performance for you compared to
!mq + cfq/deadline.


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..b5d1c80 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -200,15 +200,19 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * leave them there for as long as we can. Mark the hw queue as
 	 * needing a restart in that case.
 	 */
-	if (list_empty(&rq_list)) {
-		if (e && e->type->ops.mq.dispatch_requests)
-			e->type->ops.mq.dispatch_requests(hctx, &rq_list);
-		else
-			blk_mq_flush_busy_ctxs(hctx, &rq_list);
-	} else
+	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart(hctx);
-
-	blk_mq_dispatch_rq_list(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(hctx, &rq_list);
+	} else if (!e || !e->type->ops.mq.dispatch_requests) {
+		blk_mq_flush_busy_ctxs(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(hctx, &rq_list);
+	} else {
+		do {
+			e->type->ops.mq.dispatch_requests(hctx, &rq_list);
+			if (list_empty(&rq_list))
+				break;
+		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
+	}
 }
 
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -300,6 +304,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (blk_mq_hctx_has_pending(hctx))
+			blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int i;
+
+	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+		blk_mq_sched_restart_hctx(hctx);
+	else {
+		struct request_queue *q = hctx->queue;
+
+		if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+			return;
+
+		clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+
+		queue_for_each_hw_ctx(q, hctx, i)
+			blk_mq_sched_restart_hctx(hctx);
+	}
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -123,11 +124,6 @@ blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	BUG_ON(rq->internal_tag == -1);
 
 	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
-
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		blk_mq_run_hw_queue(hctx, true);
-	}
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
@@ -160,8 +156,15 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_sched_mark_restart(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
+			struct request_queue *q = hctx->queue;
+
+			if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+				set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+		}
+	}
 }
 
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c3e667..5d3566c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,7 +40,7 @@ static LIST_HEAD(all_q_list);
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
-static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return sbitmap_any_bit_set(&hctx->ctx_map) ||
 			!list_empty_careful(&hctx->dispatch) ||
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
 
@@ -879,6 +880,21 @@ static bool blk_mq_get_driver_tag(struct request *rq,
 	return false;
 }
 
+static void blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
+				  struct request *rq)
+{
+	if (rq->tag == -1 || rq->internal_tag == -1)
+		return;
+
+	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	rq->tag = -1;
+
+	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
+		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
+		atomic_dec(&hctx->nr_active);
+	}
+}
+
 /*
  * If we fail getting a driver tag because all the driver tags are already
  * assigned and on the dispatch list, BUT the first entry does not have a
@@ -928,8 +944,16 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
 			if (!queued && reorder_tags_to_front(list))
 				continue;
+
+			/*
+			 * We failed getting a driver tag. Mark the queue(s)
+			 * as needing a restart. Retry getting a tag again,
+			 * in case the needed IO completed right before we
+			 * marked the queue as needing a restart.
+			 */
 			blk_mq_sched_mark_restart(hctx);
-			break;
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				break;
 		}
 		list_del_init(&rq->queuelist);
 
@@ -943,6 +967,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 			queued++;
 			break;
 		case BLK_MQ_RQ_QUEUE_BUSY:
+			blk_mq_put_driver_tag(hctx, rq);
 			list_add(&rq->queuelist, list);
 			__blk_mq_requeue_request(rq);
 			break;
@@ -973,7 +998,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 */
 	if (!list_empty(list)) {
 		spin_lock(&hctx->lock);
-		list_splice(list, &hctx->dispatch);
+		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
 		/*
@@ -1476,7 +1501,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	if (q->elevator) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
-		blk_mq_sched_insert_request(rq, false, true, true);
+		blk_mq_sched_insert_request(rq, false, true, !is_sync || is_flush_fua);
 		goto done;
 	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1585,7 +1610,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	if (q->elevator) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
-		blk_mq_sched_insert_request(rq, false, true, true);
+		blk_mq_sched_insert_request(rq, false, true, !is_sync || is_flush_fua);
 		goto done;
 	}
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c24b90..077a400 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,6 +33,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index a01986d..d30a35a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -291,10 +291,14 @@ static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
 				 struct list_head *rq_list)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	struct request *rq;
 
 	spin_lock(&dd->lock);
-	blk_mq_sched_move_to_dispatch(hctx, rq_list, __dd_dispatch_request);
+	rq = __dd_dispatch_request(hctx);
 	spin_unlock(&dd->lock);
+
+	if (rq)
+		list_add_tail(&rq->queuelist, rq_list);
 }
 
 static void dd_exit_queue(struct elevator_queue *e)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ee1fb59..40ce491 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -607,6 +607,7 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
+#define QUEUE_FLAG_RESTART     28
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] queue stall with blk-mq-sched
  2017-01-26 16:42                           ` Jens Axboe
  2017-01-26 19:20                             ` Jens Axboe
@ 2017-01-27  6:58                             ` Hannes Reinecke
  1 sibling, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2017-01-27  6:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On 01/26/2017 05:42 PM, Jens Axboe wrote:
> On 01/26/2017 09:35 AM, Hannes Reinecke wrote:
>> On 01/25/2017 11:27 PM, Jens Axboe wrote:
[ .. ]
>>> And another variant, this one should be better in that it should result
>>> in less queue runs and get better merging. Hope it works with your
>>> stalls as well.
>>>
>>>
>>
>> Looking good; queue stalls are gone, and performance is okay-ish.
>> I'm getting 84k IOPs now, which is not bad.
> 
> Is that a tested-by?
> 
Not yet; while doing the performance analysis the system now got a queue
stalled with _legacy_ SQ.
Need to figure out if it's my mpt3sas patches or something else.

>> But we absolutely need to work on I/O merging; with CFQ I'm seeing
>> requests having about double the size of those done by mq-deadline.
>> (Bit unfair, I know :-)
>>
>> I'll be having some more data in time for LSF/MM.
> 
> I agree, looking at the performance delta, it's all about merging. It's
> fairly easy to observe with mq-deadline, as merging rates drop
> proportionally to the number of queues configured. But even with 1 queue
> with scsi-mq, we're still seeing lower merging rates than !mq +
> deadline, for instance.
> 
> I'll look at the merging case, it should not be that hard to bring at
> least the single queue case to parity with !mq. I'm actually surprised
> it isn't already.
> 
Thanks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-01-27  6:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-24 15:54 [PATCH] queue stall with blk-mq-sched Hannes Reinecke
2017-01-24 16:03 ` Jens Axboe
2017-01-24 18:45   ` Hannes Reinecke
2017-01-24 16:09 ` Jens Axboe
2017-01-24 18:49   ` Hannes Reinecke
2017-01-24 19:55     ` Jens Axboe
2017-01-24 22:06       ` Jens Axboe
2017-01-25  7:39         ` Hannes Reinecke
2017-01-25  8:07           ` Hannes Reinecke
2017-01-25 11:10             ` Hannes Reinecke
2017-01-25 15:52               ` Jens Axboe
2017-01-25 16:57                 ` Hannes Reinecke
2017-01-25 17:03                   ` Jens Axboe
2017-01-25 17:42                     ` Jens Axboe
2017-01-25 22:27                       ` Jens Axboe
2017-01-26 16:35                         ` Hannes Reinecke
2017-01-26 16:42                           ` Jens Axboe
2017-01-26 19:20                             ` Jens Axboe
2017-01-27  6:58                             ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.