All of lore.kernel.org
 help / color / mirror / Atom feed
* commit 01e99aeca397 causes longer runtime of block/004
@ 2020-03-04  2:38 Shinichiro Kawasaki
  2020-03-04  3:46 ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-04  2:38 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Ming Lei; +Cc: Damien Le Moal

I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
request into hctx->dispatch directly") triggers it.

The longer runtime was observed with dm-linear device which maps SATA SMR HDD
connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.

Before the commit, block/004 took around 130 seconds. After the commit, it takes
around 300 seconds. I need to dig in further details to understand why the
commit makes the test case longer.

The test case block/004 does "flush intensive workload". Is this longer runtime
expected?

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-04  2:38 commit 01e99aeca397 causes longer runtime of block/004 Shinichiro Kawasaki
@ 2020-03-04  3:46 ` Ming Lei
  2020-03-04  6:11   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-04  3:46 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> request into hctx->dispatch directly") triggers it.
> 
> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> 
> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> around 300 seconds. I need to dig in further details to understand why the
> commit makes the test case longer.
> 
> The test case block/004 does "flush intensive workload". Is this longer runtime
> expected?

The following patch might address this issue:

https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t

Please test and provide us the result.

thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-04  3:46 ` Ming Lei
@ 2020-03-04  6:11   ` Shinichiro Kawasaki
  2020-03-04  9:53     ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-04  6:11 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mar 04, 2020 / 11:46, Ming Lei wrote:
> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > request into hctx->dispatch directly") triggers it.
> > 
> > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > 
> > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > around 300 seconds. I need to dig in further details to understand why the
> > commit makes the test case longer.
> > 
> > The test case block/004 does "flush intensive workload". Is this longer runtime
> > expected?
> 
> The following patch might address this issue:
> 
> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> 
> Please test and provide us the result.
> 
> thanks,
> Ming
>

Hi Ming,

I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
Still it takes around 300 seconds.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-04  6:11   ` Shinichiro Kawasaki
@ 2020-03-04  9:53     ` Ming Lei
  2020-03-05  1:19       ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-04  9:53 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > request into hctx->dispatch directly") triggers it.
> > > 
> > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > 
> > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > around 300 seconds. I need to dig in further details to understand why the
> > > commit makes the test case longer.
> > > 
> > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > expected?
> > 
> > The following patch might address this issue:
> > 
> > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > 
> > Please test and provide us the result.
> > 
> > thanks,
> > Ming
> >
> 
> Hi Ming,
> 
> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> Still it takes around 300 seconds.

Hello Shinichiro,

block/004 only sends 1564 sync randwrite, and seems 130s has been slow
enough.

There are two related effect in that commit for your issue:

1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
request

2) all IO is added back to tail of hctx->dispatch after .queue_rq()
returns STS_RESOURCE

Seems it is more related with 2) given you can't reproduce the issue on 
SAS.

So please test the following two patches, and see which one makes a
difference for you.

BTW, both two looks not reasonable, just for narrowing down the issue.

1) patch 1

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 856356b1619e..86137c75283c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	WARN_ON(e && (rq->tag != -1));
 
 	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
-		blk_mq_request_bypass_insert(rq, at_head, false);
+		blk_mq_request_bypass_insert(rq, true, false);
 		goto run;
 	}


2) patch 2
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..447d5cb39832 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			q->mq_ops->commit_rqs(hctx);
 
 		spin_lock(&hctx->lock);
-		list_splice_tail_init(list, &hctx->dispatch);
+		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
 		/*


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-04  9:53     ` Ming Lei
@ 2020-03-05  1:19       ` Shinichiro Kawasaki
  2020-03-05  2:48         ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-05  1:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mar 04, 2020 / 17:53, Ming Lei wrote:
> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > request into hctx->dispatch directly") triggers it.
> > > > 
> > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > 
> > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > around 300 seconds. I need to dig in further details to understand why the
> > > > commit makes the test case longer.
> > > > 
> > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > expected?
> > > 
> > > The following patch might address this issue:
> > > 
> > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > 
> > > Please test and provide us the result.
> > > 
> > > thanks,
> > > Ming
> > >
> > 
> > Hi Ming,
> > 
> > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > Still it takes around 300 seconds.
> 
> Hello Shinichiro,
> 
> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> enough.
> 
> There are two related effect in that commit for your issue:
> 
> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> request
> 
> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> returns STS_RESOURCE
> 
> Seems it is more related with 2) given you can't reproduce the issue on 
> SAS.
> 
> So please test the following two patches, and see which one makes a
> difference for you.
> 
> BTW, both two looks not reasonable, just for narrowing down the issue.
> 
> 1) patch 1
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 856356b1619e..86137c75283c 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>  	WARN_ON(e && (rq->tag != -1));
>  
>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> -		blk_mq_request_bypass_insert(rq, at_head, false);
> +		blk_mq_request_bypass_insert(rq, true, false);
>  		goto run;
>  	}

Ming, thank you for the trial patches.
This "patch 1" reduced the runtime, as short as rc3.

> 
> 
> 2) patch 2
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d92088dec6c3..447d5cb39832 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  			q->mq_ops->commit_rqs(hctx);
>  
>  		spin_lock(&hctx->lock);
> -		list_splice_tail_init(list, &hctx->dispatch);
> +		list_splice_init(list, &hctx->dispatch);
>  		spin_unlock(&hctx->lock);
>  
>  		/*

This patch 2 didn't reduce the runtime.

Wish this report helps.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-05  1:19       ` Shinichiro Kawasaki
@ 2020-03-05  2:48         ` Ming Lei
  2020-03-06  6:06           ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-05  2:48 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

Hi Shinichiro,

On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > request into hctx->dispatch directly") triggers it.
> > > > > 
> > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > 
> > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > commit makes the test case longer.
> > > > > 
> > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > expected?
> > > > 
> > > > The following patch might address this issue:
> > > > 
> > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > 
> > > > Please test and provide us the result.
> > > > 
> > > > thanks,
> > > > Ming
> > > >
> > > 
> > > Hi Ming,
> > > 
> > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > Still it takes around 300 seconds.
> > 
> > Hello Shinichiro,
> > 
> > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > enough.
> > 
> > There are two related effect in that commit for your issue:
> > 
> > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > request
> > 
> > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > returns STS_RESOURCE
> > 
> > Seems it is more related with 2) given you can't reproduce the issue on 
> > SAS.
> > 
> > So please test the following two patches, and see which one makes a
> > difference for you.
> > 
> > BTW, both two looks not reasonable, just for narrowing down the issue.
> > 
> > 1) patch 1
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 856356b1619e..86137c75283c 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> >  	WARN_ON(e && (rq->tag != -1));
> >  
> >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > +		blk_mq_request_bypass_insert(rq, true, false);
> >  		goto run;
> >  	}
> 
> Ming, thank you for the trial patches.
> This "patch 1" reduced the runtime, as short as rc3.
> 
> > 
> > 
> > 2) patch 2
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index d92088dec6c3..447d5cb39832 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  			q->mq_ops->commit_rqs(hctx);
> >  
> >  		spin_lock(&hctx->lock);
> > -		list_splice_tail_init(list, &hctx->dispatch);
> > +		list_splice_init(list, &hctx->dispatch);
> >  		spin_unlock(&hctx->lock);
> >  
> >  		/*
> 
> This patch 2 didn't reduce the runtime.
> 
> Wish this report helps.

Your feedback does help, then please test the following patch:

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 5cc775bdb06a..68957802f96f 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	flush_rq->rq_disk = first_rq->rq_disk;
 	flush_rq->end_io = flush_end_io;
 
-	blk_flush_queue_rq(flush_rq, false);
+	blk_flush_queue_rq(flush_rq, true);
 }
 
 static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..56d61b693f2e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -724,6 +724,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
 	spin_unlock_irq(&q->requeue_lock);
 
 	list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+		bool at_head = !!(rq->rq_flags & RQF_SOFTBARRIER);
+
 		if (!(rq->rq_flags & (RQF_SOFTBARRIER | RQF_DONTPREP)))
 			continue;
 
@@ -735,9 +737,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		 * merge.
 		 */
 		if (rq->rq_flags & RQF_DONTPREP)
-			blk_mq_request_bypass_insert(rq, false, false);
+			blk_mq_request_bypass_insert(rq, at_head, false);
 		else
-			blk_mq_sched_insert_request(rq, true, false, false);
+			blk_mq_sched_insert_request(rq, at_head, false, false);
 	}
 
 	while (!list_empty(&rq_list)) {

Thanks,
Ming


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-05  2:48         ` Ming Lei
@ 2020-03-06  6:06           ` Shinichiro Kawasaki
  2020-03-06  8:13             ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-06  6:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mar 05, 2020 / 10:48, Ming Lei wrote:
> Hi Shinichiro,
> 
> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > 
> > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > 
> > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > commit makes the test case longer.
> > > > > > 
> > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > expected?
> > > > > 
> > > > > The following patch might address this issue:
> > > > > 
> > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > 
> > > > > Please test and provide us the result.
> > > > > 
> > > > > thanks,
> > > > > Ming
> > > > >
> > > > 
> > > > Hi Ming,
> > > > 
> > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > Still it takes around 300 seconds.
> > > 
> > > Hello Shinichiro,
> > > 
> > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > enough.
> > > 
> > > There are two related effect in that commit for your issue:
> > > 
> > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > request
> > > 
> > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > returns STS_RESOURCE
> > > 
> > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > SAS.
> > > 
> > > So please test the following two patches, and see which one makes a
> > > difference for you.
> > > 
> > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > 
> > > 1) patch 1
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 856356b1619e..86137c75283c 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > >  	WARN_ON(e && (rq->tag != -1));
> > >  
> > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > +		blk_mq_request_bypass_insert(rq, true, false);
> > >  		goto run;
> > >  	}
> > 
> > Ming, thank you for the trial patches.
> > This "patch 1" reduced the runtime, as short as rc3.
> > 
> > > 
> > > 
> > > 2) patch 2
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index d92088dec6c3..447d5cb39832 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > >  			q->mq_ops->commit_rqs(hctx);
> > >  
> > >  		spin_lock(&hctx->lock);
> > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > +		list_splice_init(list, &hctx->dispatch);
> > >  		spin_unlock(&hctx->lock);
> > >  
> > >  		/*
> > 
> > This patch 2 didn't reduce the runtime.
> > 
> > Wish this report helps.
> 
> Your feedback does help, then please test the following patch:

Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
it reduces the runtime as short as rc3. Good.

-- 
Best Regards,
Shin'ichiro Kawasaki

> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 5cc775bdb06a..68957802f96f 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>  	flush_rq->rq_disk = first_rq->rq_disk;
>  	flush_rq->end_io = flush_end_io;
>  
> -	blk_flush_queue_rq(flush_rq, false);
> +	blk_flush_queue_rq(flush_rq, true);
>  }
>  
>  static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d92088dec6c3..56d61b693f2e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -724,6 +724,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
>  	spin_unlock_irq(&q->requeue_lock);
>  
>  	list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
> +		bool at_head = !!(rq->rq_flags & RQF_SOFTBARRIER);
> +
>  		if (!(rq->rq_flags & (RQF_SOFTBARRIER | RQF_DONTPREP)))
>  			continue;
>  
> @@ -735,9 +737,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
>  		 * merge.
>  		 */
>  		if (rq->rq_flags & RQF_DONTPREP)
> -			blk_mq_request_bypass_insert(rq, false, false);
> +			blk_mq_request_bypass_insert(rq, at_head, false);
>  		else
> -			blk_mq_sched_insert_request(rq, true, false, false);
> +			blk_mq_sched_insert_request(rq, at_head, false, false);
>  	}
>  
>  	while (!list_empty(&rq_list)) {
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-06  6:06           ` Shinichiro Kawasaki
@ 2020-03-06  8:13             ` Ming Lei
  2020-03-07  1:02               ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-06  8:13 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > Hi Shinichiro,
> > 
> > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > 
> > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > 
> > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > commit makes the test case longer.
> > > > > > > 
> > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > expected?
> > > > > > 
> > > > > > The following patch might address this issue:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > 
> > > > > > Please test and provide us the result.
> > > > > > 
> > > > > > thanks,
> > > > > > Ming
> > > > > >
> > > > > 
> > > > > Hi Ming,
> > > > > 
> > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > Still it takes around 300 seconds.
> > > > 
> > > > Hello Shinichiro,
> > > > 
> > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > enough.
> > > > 
> > > > There are two related effect in that commit for your issue:
> > > > 
> > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > request
> > > > 
> > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > returns STS_RESOURCE
> > > > 
> > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > SAS.
> > > > 
> > > > So please test the following two patches, and see which one makes a
> > > > difference for you.
> > > > 
> > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > 
> > > > 1) patch 1
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index 856356b1619e..86137c75283c 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > >  	WARN_ON(e && (rq->tag != -1));
> > > >  
> > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > >  		goto run;
> > > >  	}
> > > 
> > > Ming, thank you for the trial patches.
> > > This "patch 1" reduced the runtime, as short as rc3.
> > > 
> > > > 
> > > > 
> > > > 2) patch 2
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index d92088dec6c3..447d5cb39832 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > >  			q->mq_ops->commit_rqs(hctx);
> > > >  
> > > >  		spin_lock(&hctx->lock);
> > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > +		list_splice_init(list, &hctx->dispatch);
> > > >  		spin_unlock(&hctx->lock);
> > > >  
> > > >  		/*
> > > 
> > > This patch 2 didn't reduce the runtime.
> > > 
> > > Wish this report helps.
> > 
> > Your feedback does help, then please test the following patch:
> 
> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> it reduces the runtime as short as rc3. Good.

Hi Shinichiro,

Thanks for your test!

Then I think the following change should make the difference actually,
you may double check that and confirm if it is that.

> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>  	flush_rq->rq_disk = first_rq->rq_disk;
>  	flush_rq->end_io = flush_end_io;
>  
> -	blk_flush_queue_rq(flush_rq, false);
> +	blk_flush_queue_rq(flush_rq, true);

However, the flush request is added to tail of dispatch queue[1] for long time.
0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
and its predecessor(all mq scheduler start patches) changed to add flush request
to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
correct.

I guess once flush rq is added to tail of dispatch queue in block/004,
in which lots of FS request may stay in hctx->dispatch because of low
AHCI queue depth, then we may take a bit long for flush rq to be
submitted to LLD.

I'd suggest to root cause/understand the issue given it isn't obvious
correct to queue flush rq at front of dispatch queue, so could you collect
the following trace via the following script with/without the single line
patch?

#!/bin/sh

MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))

/usr/share/bcc/tools/trace -t -C \
    't:block:block_getrq (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
    't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
    't:block:block_rq_complete (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'

note:
1) install bcc package
2) pass the test disk(AHCI/SMR) major & minor to above script
3) start the script before running block/004



[1] 53d63e6b0dfb block: make the flush insertion use the tail of the dispatch list

Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-06  8:13             ` Ming Lei
@ 2020-03-07  1:02               ` Shinichiro Kawasaki
  2020-03-07  4:13                 ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-07  1:02 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mar 06, 2020 / 16:13, Ming Lei wrote:
> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > Hi Shinichiro,
> > > 
> > > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > > 
> > > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > > 
> > > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > > commit makes the test case longer.
> > > > > > > > 
> > > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > > expected?
> > > > > > > 
> > > > > > > The following patch might address this issue:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > > 
> > > > > > > Please test and provide us the result.
> > > > > > > 
> > > > > > > thanks,
> > > > > > > Ming
> > > > > > >
> > > > > > 
> > > > > > Hi Ming,
> > > > > > 
> > > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > Still it takes around 300 seconds.
> > > > > 
> > > > > Hello Shinichiro,
> > > > > 
> > > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > enough.
> > > > > 
> > > > > There are two related effect in that commit for your issue:
> > > > > 
> > > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > request
> > > > > 
> > > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > returns STS_RESOURCE
> > > > > 
> > > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > SAS.
> > > > > 
> > > > > So please test the following two patches, and see which one makes a
> > > > > difference for you.
> > > > > 
> > > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > 
> > > > > 1) patch 1
> > > > > 
> > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > index 856356b1619e..86137c75283c 100644
> > > > > --- a/block/blk-mq-sched.c
> > > > > +++ b/block/blk-mq-sched.c
> > > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > >  	WARN_ON(e && (rq->tag != -1));
> > > > >  
> > > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > > >  		goto run;
> > > > >  	}
> > > > 
> > > > Ming, thank you for the trial patches.
> > > > This "patch 1" reduced the runtime, as short as rc3.
> > > > 
> > > > > 
> > > > > 
> > > > > 2) patch 2
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index d92088dec6c3..447d5cb39832 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > >  			q->mq_ops->commit_rqs(hctx);
> > > > >  
> > > > >  		spin_lock(&hctx->lock);
> > > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > +		list_splice_init(list, &hctx->dispatch);
> > > > >  		spin_unlock(&hctx->lock);
> > > > >  
> > > > >  		/*
> > > > 
> > > > This patch 2 didn't reduce the runtime.
> > > > 
> > > > Wish this report helps.
> > > 
> > > Your feedback does help, then please test the following patch:
> > 
> > Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > it reduces the runtime as short as rc3. Good.
> 
> Hi Shinichiro,
> 
> Thanks for your test!
> 
> Then I think the following change should make the difference actually,
> you may double check that and confirm if it is that.
> 
> > @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> >  	flush_rq->rq_disk = first_rq->rq_disk;
> >  	flush_rq->end_io = flush_end_io;
> >  
> > -	blk_flush_queue_rq(flush_rq, false);
> > +	blk_flush_queue_rq(flush_rq, true);

Yes, with this the one line change above only, the runtime was reduced.

> 
> However, the flush request is added to tail of dispatch queue[1] for long time.
> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> and its predecessor(all mq scheduler start patches) changed to add flush request
> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> correct.
> 
> I guess once flush rq is added to tail of dispatch queue in block/004,
> in which lots of FS request may stay in hctx->dispatch because of low
> AHCI queue depth, then we may take a bit long for flush rq to be
> submitted to LLD.
> 
> I'd suggest to root cause/understand the issue given it isn't obvious
> correct to queue flush rq at front of dispatch queue, so could you collect
> the following trace via the following script with/without the single line
> patch?

Thank you for the thoughts for the correct design. I have taken the two traces,
with and without the one liner patch above. The gzip archived trace files have
1.6MB size. It looks too large to post to the list. Please let me know how you
want the trace files shared.

Thanks!

-- 
Best Regards,
Shin'ichiro Kawasaki

> 
> #!/bin/sh
> 
> MAJ=$1
> MIN=$2
> MAJ=$(( $MAJ << 20 ))
> DEV=$(( $MAJ | $MIN ))
> 
> /usr/share/bcc/tools/trace -t -C \
>     't:block:block_getrq (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
>     't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
>     't:block:block_rq_complete (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
> 
> note:
> 1) install bcc package
> 2) pass the test disk(AHCI/SMR) major & minor to above script
> 3) start the script before running block/004
> 
> 
> 
> [1] 53d63e6b0dfb block: make the flush insertion use the tail of the dispatch list
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-07  1:02               ` Shinichiro Kawasaki
@ 2020-03-07  4:13                 ` Ming Lei
  2020-03-09  0:07                   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-07  4:13 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > Hi Shinichiro,
> > > > 
> > > > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > > > 
> > > > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > > > 
> > > > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > > > commit makes the test case longer.
> > > > > > > > > 
> > > > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > > > expected?
> > > > > > > > 
> > > > > > > > The following patch might address this issue:
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > > > 
> > > > > > > > Please test and provide us the result.
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > Ming
> > > > > > > >
> > > > > > > 
> > > > > > > Hi Ming,
> > > > > > > 
> > > > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > > Still it takes around 300 seconds.
> > > > > > 
> > > > > > Hello Shinichiro,
> > > > > > 
> > > > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > > enough.
> > > > > > 
> > > > > > There are two related effect in that commit for your issue:
> > > > > > 
> > > > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > > request
> > > > > > 
> > > > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > > returns STS_RESOURCE
> > > > > > 
> > > > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > > SAS.
> > > > > > 
> > > > > > So please test the following two patches, and see which one makes a
> > > > > > difference for you.
> > > > > > 
> > > > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > > 
> > > > > > 1) patch 1
> > > > > > 
> > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > index 856356b1619e..86137c75283c 100644
> > > > > > --- a/block/blk-mq-sched.c
> > > > > > +++ b/block/blk-mq-sched.c
> > > > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > >  	WARN_ON(e && (rq->tag != -1));
> > > > > >  
> > > > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > >  		goto run;
> > > > > >  	}
> > > > > 
> > > > > Ming, thank you for the trial patches.
> > > > > This "patch 1" reduced the runtime, as short as rc3.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 2) patch 2
> > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > index d92088dec6c3..447d5cb39832 100644
> > > > > > --- a/block/blk-mq.c
> > > > > > +++ b/block/blk-mq.c
> > > > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > >  			q->mq_ops->commit_rqs(hctx);
> > > > > >  
> > > > > >  		spin_lock(&hctx->lock);
> > > > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > > +		list_splice_init(list, &hctx->dispatch);
> > > > > >  		spin_unlock(&hctx->lock);
> > > > > >  
> > > > > >  		/*
> > > > > 
> > > > > This patch 2 didn't reduce the runtime.
> > > > > 
> > > > > Wish this report helps.
> > > > 
> > > > Your feedback does help, then please test the following patch:
> > > 
> > > Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > it reduces the runtime as short as rc3. Good.
> > 
> > Hi Shinichiro,
> > 
> > Thanks for your test!
> > 
> > Then I think the following change should make the difference actually,
> > you may double check that and confirm if it is that.
> > 
> > > @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > >  	flush_rq->rq_disk = first_rq->rq_disk;
> > >  	flush_rq->end_io = flush_end_io;
> > >  
> > > -	blk_flush_queue_rq(flush_rq, false);
> > > +	blk_flush_queue_rq(flush_rq, true);
> 
> Yes, with this the one line change above only, the runtime was reduced.
> 
> > 
> > However, the flush request is added to tail of dispatch queue[1] for long time.
> > 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > and its predecessor(all mq scheduler start patches) changed to add flush request
> > to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > correct.
> > 
> > I guess once flush rq is added to tail of dispatch queue in block/004,
> > in which lots of FS request may stay in hctx->dispatch because of low
> > AHCI queue depth, then we may take a bit long for flush rq to be
> > submitted to LLD.
> > 
> > I'd suggest to root cause/understand the issue given it isn't obvious
> > correct to queue flush rq at front of dispatch queue, so could you collect
> > the following trace via the following script with/without the single line
> > patch?
> 
> Thank you for the thoughts for the correct design. I have taken the two traces,
> with and without the one liner patch above. The gzip archived trace files have
> 1.6MB size. It looks too large to post to the list. Please let me know how you
> want the trace files shared.

I didn't thought the trace can be so big given the ios should be just
256 * 64(1564).

You may put the log somewhere in Internet, cloud storage, web, or
whatever. Then just provides us the link.

Or if you can't find a place to hold it, just send to me, and I will put
it in my RH people web link.


Thank, 
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-07  4:13                 ` Ming Lei
@ 2020-03-09  0:07                   ` Shinichiro Kawasaki
  2020-03-09 16:14                     ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-09  0:07 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mar 07, 2020 / 12:13, Ming Lei wrote:
> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > > Hi Shinichiro,
> > > > > 
> > > > > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > > > > 
> > > > > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > > > > 
> > > > > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > > > > commit makes the test case longer.
> > > > > > > > > > 
> > > > > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > > > > expected?
> > > > > > > > > 
> > > > > > > > > The following patch might address this issue:
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > > > > 
> > > > > > > > > Please test and provide us the result.
> > > > > > > > > 
> > > > > > > > > thanks,
> > > > > > > > > Ming
> > > > > > > > >
> > > > > > > > 
> > > > > > > > Hi Ming,
> > > > > > > > 
> > > > > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > > > Still it takes around 300 seconds.
> > > > > > > 
> > > > > > > Hello Shinichiro,
> > > > > > > 
> > > > > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > > > enough.
> > > > > > > 
> > > > > > > There are two related effect in that commit for your issue:
> > > > > > > 
> > > > > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > > > request
> > > > > > > 
> > > > > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > > > returns STS_RESOURCE
> > > > > > > 
> > > > > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > > > SAS.
> > > > > > > 
> > > > > > > So please test the following two patches, and see which one makes a
> > > > > > > difference for you.
> > > > > > > 
> > > > > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > > > 
> > > > > > > 1) patch 1
> > > > > > > 
> > > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > > index 856356b1619e..86137c75283c 100644
> > > > > > > --- a/block/blk-mq-sched.c
> > > > > > > +++ b/block/blk-mq-sched.c
> > > > > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > > >  	WARN_ON(e && (rq->tag != -1));
> > > > > > >  
> > > > > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > > >  		goto run;
> > > > > > >  	}
> > > > > > 
> > > > > > Ming, thank you for the trial patches.
> > > > > > This "patch 1" reduced the runtime, as short as rc3.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 2) patch 2
> > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > index d92088dec6c3..447d5cb39832 100644
> > > > > > > --- a/block/blk-mq.c
> > > > > > > +++ b/block/blk-mq.c
> > > > > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > > >  			q->mq_ops->commit_rqs(hctx);
> > > > > > >  
> > > > > > >  		spin_lock(&hctx->lock);
> > > > > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > > > +		list_splice_init(list, &hctx->dispatch);
> > > > > > >  		spin_unlock(&hctx->lock);
> > > > > > >  
> > > > > > >  		/*
> > > > > > 
> > > > > > This patch 2 didn't reduce the runtime.
> > > > > > 
> > > > > > Wish this report helps.
> > > > > 
> > > > > Your feedback does help, then please test the following patch:
> > > > 
> > > > Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > it reduces the runtime as short as rc3. Good.
> > > 
> > > Hi Shinichiro,
> > > 
> > > Thanks for your test!
> > > 
> > > Then I think the following change should make the difference actually,
> > > you may double check that and confirm if it is that.
> > > 
> > > > @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > >  	flush_rq->rq_disk = first_rq->rq_disk;
> > > >  	flush_rq->end_io = flush_end_io;
> > > >  
> > > > -	blk_flush_queue_rq(flush_rq, false);
> > > > +	blk_flush_queue_rq(flush_rq, true);
> > 
> > Yes, with this the one line change above only, the runtime was reduced.
> > 
> > > 
> > > However, the flush request is added to tail of dispatch queue[1] for long time.
> > > 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > and its predecessor(all mq scheduler start patches) changed to add flush request
> > > to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > correct.
> > > 
> > > I guess once flush rq is added to tail of dispatch queue in block/004,
> > > in which lots of FS request may stay in hctx->dispatch because of low
> > > AHCI queue depth, then we may take a bit long for flush rq to be
> > > submitted to LLD.
> > > 
> > > I'd suggest to root cause/understand the issue given it isn't obvious
> > > correct to queue flush rq at front of dispatch queue, so could you collect
> > > the following trace via the following script with/without the single line
> > > patch?
> > 
> > Thank you for the thoughts for the correct design. I have taken the two traces,
> > with and without the one liner patch above. The gzip archived trace files have
> > 1.6MB size. It looks too large to post to the list. Please let me know how you
> > want the trace files shared.
> 
> I didn't thought the trace can be so big given the ios should be just
> 256 * 64(1564).
> 
> You may put the log somewhere in Internet, cloud storage, web, or
> whatever. Then just provides us the link.
> 
> Or if you can't find a place to hold it, just send to me, and I will put
> it in my RH people web link.

I have sent another e-mail only to you attaching the log files gziped.
Your confirmation will be appreciated.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-09  0:07                   ` Shinichiro Kawasaki
@ 2020-03-09 16:14                     ` Ming Lei
  2020-03-10  3:07                       ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-09 16:14 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe, Damien Le Moal

On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > > On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > > > Hi Shinichiro,
> > > > > > 
> > > > > > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > > > > > 
> > > > > > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > > > > > 
> > > > > > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > > > > > commit makes the test case longer.
> > > > > > > > > > > 
> > > > > > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > > > > > expected?
> > > > > > > > > > 
> > > > > > > > > > The following patch might address this issue:
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > > > > > 
> > > > > > > > > > Please test and provide us the result.
> > > > > > > > > > 
> > > > > > > > > > thanks,
> > > > > > > > > > Ming
> > > > > > > > > >
> > > > > > > > > 
> > > > > > > > > Hi Ming,
> > > > > > > > > 
> > > > > > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > > > > Still it takes around 300 seconds.
> > > > > > > > 
> > > > > > > > Hello Shinichiro,
> > > > > > > > 
> > > > > > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > > > > enough.
> > > > > > > > 
> > > > > > > > There are two related effect in that commit for your issue:
> > > > > > > > 
> > > > > > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > > > > request
> > > > > > > > 
> > > > > > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > > > > returns STS_RESOURCE
> > > > > > > > 
> > > > > > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > > > > SAS.
> > > > > > > > 
> > > > > > > > So please test the following two patches, and see which one makes a
> > > > > > > > difference for you.
> > > > > > > > 
> > > > > > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > > > > 
> > > > > > > > 1) patch 1
> > > > > > > > 
> > > > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > > > index 856356b1619e..86137c75283c 100644
> > > > > > > > --- a/block/blk-mq-sched.c
> > > > > > > > +++ b/block/blk-mq-sched.c
> > > > > > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > > > >  	WARN_ON(e && (rq->tag != -1));
> > > > > > > >  
> > > > > > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > > > >  		goto run;
> > > > > > > >  	}
> > > > > > > 
> > > > > > > Ming, thank you for the trial patches.
> > > > > > > This "patch 1" reduced the runtime, as short as rc3.
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 2) patch 2
> > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > > index d92088dec6c3..447d5cb39832 100644
> > > > > > > > --- a/block/blk-mq.c
> > > > > > > > +++ b/block/blk-mq.c
> > > > > > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > > > >  			q->mq_ops->commit_rqs(hctx);
> > > > > > > >  
> > > > > > > >  		spin_lock(&hctx->lock);
> > > > > > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > > > > +		list_splice_init(list, &hctx->dispatch);
> > > > > > > >  		spin_unlock(&hctx->lock);
> > > > > > > >  
> > > > > > > >  		/*
> > > > > > > 
> > > > > > > This patch 2 didn't reduce the runtime.
> > > > > > > 
> > > > > > > Wish this report helps.
> > > > > > 
> > > > > > Your feedback does help, then please test the following patch:
> > > > > 
> > > > > Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > > it reduces the runtime as short as rc3. Good.
> > > > 
> > > > Hi Shinichiro,
> > > > 
> > > > Thanks for your test!
> > > > 
> > > > Then I think the following change should make the difference actually,
> > > > you may double check that and confirm if it is that.
> > > > 
> > > > > @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > > >  	flush_rq->rq_disk = first_rq->rq_disk;
> > > > >  	flush_rq->end_io = flush_end_io;
> > > > >  
> > > > > -	blk_flush_queue_rq(flush_rq, false);
> > > > > +	blk_flush_queue_rq(flush_rq, true);
> > > 
> > > Yes, with this the one line change above only, the runtime was reduced.
> > > 
> > > > 
> > > > However, the flush request is added to tail of dispatch queue[1] for long time.
> > > > 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > > and its predecessor(all mq scheduler start patches) changed to add flush request
> > > > to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > > blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > > correct.
> > > > 
> > > > I guess once flush rq is added to tail of dispatch queue in block/004,
> > > > in which lots of FS request may stay in hctx->dispatch because of low
> > > > AHCI queue depth, then we may take a bit long for flush rq to be
> > > > submitted to LLD.
> > > > 
> > > > I'd suggest to root cause/understand the issue given it isn't obvious
> > > > correct to queue flush rq at front of dispatch queue, so could you collect
> > > > the following trace via the following script with/without the single line
> > > > patch?
> > > 
> > > Thank you for the thoughts for the correct design. I have taken the two traces,
> > > with and without the one liner patch above. The gzip archived trace files have
> > > 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > want the trace files shared.
> > 
> > I didn't thought the trace can be so big given the ios should be just
> > 256 * 64(1564).
> > 
> > You may put the log somewhere in Internet, cloud storage, web, or
> > whatever. Then just provides us the link.
> > 
> > Or if you can't find a place to hold it, just send to me, and I will put
> > it in my RH people web link.
> 
> I have sent another e-mail only to you attaching the log files gziped.
> Your confirmation will be appreciated.

Yeah, I got the log, and it has been put in the following link:

http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz

Also I have run some analysis, and block/004 basically call pwrite() &
fsync() in each job.

1) v5.6-rc kernel
- write rq average latency: 0.091s 
- flush rq average latency: 0.018s
- average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
- average issue times of flush request: 1.052

2) v5.6-rc patched kernel
- write rq average latency: 0.031
- flush rq average latency: 0.035
- average issue times of write request: 1.466
- average issue times of flush request: 1.610


block/004 starts 64 jobs and AHCI's queue can become saturated easily,
then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
queue can only move on after one request is completed.

Looks the flush request takes shorter time than normal write IO.
If flush request is added to front of dispatch queue, the next normal
write IO could be queued to lld quicker than adding to tail of dispatch
queue.

trace_block_rq_issue() is called more than one time is because of
AHCI or the drive's implementation. It usually means that
host->hostt->queuecommand() fails for several times when queuing one
single request. For AHCI, I understand it shouldn't fail normally given
we guarantee that the actual queue depth is <= either LUN or host's
queue depth. Maybe it depends on your SMR's implementation about handling
flush/write IO. Could you check why .queuecommand() fails in block/004?

Also can you provide queue flags via the following command?

	cat /sys/kernel/debug/block/sdN/state

I understand flush request should be slower than normal write, however
looks not true in this hardware.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-09 16:14                     ` Ming Lei
@ 2020-03-10  3:07                       ` Damien Le Moal
  2020-03-10  5:54                         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2020-03-10  3:07 UTC (permalink / raw)
  To: Ming Lei, Shinichiro Kawasaki; +Cc: linux-block, Jens Axboe

On 2020/03/10 1:14, Ming Lei wrote:
> On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
>> On Mar 07, 2020 / 12:13, Ming Lei wrote:
>>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
>>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
>>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
>>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
>>>>>>> Hi Shinichiro,
>>>>>>>
>>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
>>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
>>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
>>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
>>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
>>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
>>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
>>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
>>>>>>>>>>>>
>>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
>>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
>>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
>>>>>>>>>>>>
>>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
>>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
>>>>>>>>>>>> commit makes the test case longer.
>>>>>>>>>>>>
>>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
>>>>>>>>>>>> expected?
>>>>>>>>>>>
>>>>>>>>>>> The following patch might address this issue:
>>>>>>>>>>>
>>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
>>>>>>>>>>>
>>>>>>>>>>> Please test and provide us the result.
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>> Ming
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Ming,
>>>>>>>>>>
>>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
>>>>>>>>>> Still it takes around 300 seconds.
>>>>>>>>>
>>>>>>>>> Hello Shinichiro,
>>>>>>>>>
>>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
>>>>>>>>> enough.
>>>>>>>>>
>>>>>>>>> There are two related effect in that commit for your issue:
>>>>>>>>>
>>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
>>>>>>>>> request
>>>>>>>>>
>>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
>>>>>>>>> returns STS_RESOURCE
>>>>>>>>>
>>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
>>>>>>>>> SAS.
>>>>>>>>>
>>>>>>>>> So please test the following two patches, and see which one makes a
>>>>>>>>> difference for you.
>>>>>>>>>
>>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
>>>>>>>>>
>>>>>>>>> 1) patch 1
>>>>>>>>>
>>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>>>>>>> index 856356b1619e..86137c75283c 100644
>>>>>>>>> --- a/block/blk-mq-sched.c
>>>>>>>>> +++ b/block/blk-mq-sched.c
>>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
>>>>>>>>>  
>>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
>>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
>>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
>>>>>>>>>  		goto run;
>>>>>>>>>  	}
>>>>>>>>
>>>>>>>> Ming, thank you for the trial patches.
>>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2) patch 2
>>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>>>> index d92088dec6c3..447d5cb39832 100644
>>>>>>>>> --- a/block/blk-mq.c
>>>>>>>>> +++ b/block/blk-mq.c
>>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
>>>>>>>>>  
>>>>>>>>>  		spin_lock(&hctx->lock);
>>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
>>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
>>>>>>>>>  		spin_unlock(&hctx->lock);
>>>>>>>>>  
>>>>>>>>>  		/*
>>>>>>>>
>>>>>>>> This patch 2 didn't reduce the runtime.
>>>>>>>>
>>>>>>>> Wish this report helps.
>>>>>>>
>>>>>>> Your feedback does help, then please test the following patch:
>>>>>>
>>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
>>>>>> it reduces the runtime as short as rc3. Good.
>>>>>
>>>>> Hi Shinichiro,
>>>>>
>>>>> Thanks for your test!
>>>>>
>>>>> Then I think the following change should make the difference actually,
>>>>> you may double check that and confirm if it is that.
>>>>>
>>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
>>>>>>  	flush_rq->end_io = flush_end_io;
>>>>>>  
>>>>>> -	blk_flush_queue_rq(flush_rq, false);
>>>>>> +	blk_flush_queue_rq(flush_rq, true);
>>>>
>>>> Yes, with this the one line change above only, the runtime was reduced.
>>>>
>>>>>
>>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
>>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
>>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
>>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
>>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
>>>>> correct.
>>>>>
>>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
>>>>> in which lots of FS request may stay in hctx->dispatch because of low
>>>>> AHCI queue depth, then we may take a bit long for flush rq to be
>>>>> submitted to LLD.
>>>>>
>>>>> I'd suggest to root cause/understand the issue given it isn't obvious
>>>>> correct to queue flush rq at front of dispatch queue, so could you collect
>>>>> the following trace via the following script with/without the single line
>>>>> patch?
>>>>
>>>> Thank you for the thoughts for the correct design. I have taken the two traces,
>>>> with and without the one liner patch above. The gzip archived trace files have
>>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
>>>> want the trace files shared.
>>>
>>> I didn't thought the trace can be so big given the ios should be just
>>> 256 * 64(1564).
>>>
>>> You may put the log somewhere in Internet, cloud storage, web, or
>>> whatever. Then just provides us the link.
>>>
>>> Or if you can't find a place to hold it, just send to me, and I will put
>>> it in my RH people web link.
>>
>> I have sent another e-mail only to you attaching the log files gziped.
>> Your confirmation will be appreciated.
> 
> Yeah, I got the log, and it has been put in the following link:
> 
> http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> 
> Also I have run some analysis, and block/004 basically call pwrite() &
> fsync() in each job.
> 
> 1) v5.6-rc kernel
> - write rq average latency: 0.091s 
> - flush rq average latency: 0.018s
> - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> - average issue times of flush request: 1.052
> 
> 2) v5.6-rc patched kernel
> - write rq average latency: 0.031
> - flush rq average latency: 0.035
> - average issue times of write request: 1.466
> - average issue times of flush request: 1.610
> 
> 
> block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> queue can only move on after one request is completed.
> 
> Looks the flush request takes shorter time than normal write IO.
> If flush request is added to front of dispatch queue, the next normal
> write IO could be queued to lld quicker than adding to tail of dispatch
> queue.
> trace_block_rq_issue() is called more than one time is because of
> AHCI or the drive's implementation. It usually means that
> host->hostt->queuecommand() fails for several times when queuing one
> single request. For AHCI, I understand it shouldn't fail normally given
> we guarantee that the actual queue depth is <= either LUN or host's
> queue depth. Maybe it depends on your SMR's implementation about handling
> flush/write IO. Could you check why .queuecommand() fails in block/004?

Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
the AHCI code beside disk probe checks. So write & flush handling does not
differ between SMR and regular disks. The same applies to the drive side. write
and flush commands are the normal commands, no change at all. The only
difference being the sequential write constraint which the drive honors by not
executing the queued write command out of order. But there are no constraint for
flush on SMR, so we get whatever the drive does, that is, totally drive dependent.

I wonder if the issue may be with the particular AHCI chipset used for this test.

> 
> Also can you provide queue flags via the following command?
> 
> 	cat /sys/kernel/debug/block/sdN/state
> 
> I understand flush request should be slower than normal write, however
> looks not true in this hardware.

Probably due to the fact that since the writes are sequential, there is a lot of
drive internal optimization that can be done to minimize the cost of flush
(internal data streaming from cache to media, media-cache use, etc) That is true
for regular disks too. And all of this is highly dependent on the hardware
implementation.

Thank you for your help !

Shinichiro,

It may be worth trying the same run with & without Ming's patch on a machine
with a different chipset...

> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10  3:07                       ` Damien Le Moal
@ 2020-03-10  5:54                         ` Shinichiro Kawasaki
  2020-03-10  6:00                           ` Damien Le Moal
  2020-03-10  8:07                           ` Ming Lei
  0 siblings, 2 replies; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-10  5:54 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Ming Lei, linux-block, Jens Axboe

Ming, thank you for sharing the log files and analysis.

On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> On 2020/03/10 1:14, Ming Lei wrote:
> > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> >>>>>>> Hi Shinichiro,
> >>>>>>>
> >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> >>>>>>>>>>>> commit makes the test case longer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> >>>>>>>>>>>> expected?
> >>>>>>>>>>>
> >>>>>>>>>>> The following patch might address this issue:
> >>>>>>>>>>>
> >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> >>>>>>>>>>>
> >>>>>>>>>>> Please test and provide us the result.
> >>>>>>>>>>>
> >>>>>>>>>>> thanks,
> >>>>>>>>>>> Ming
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Ming,
> >>>>>>>>>>
> >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> >>>>>>>>>> Still it takes around 300 seconds.
> >>>>>>>>>
> >>>>>>>>> Hello Shinichiro,
> >>>>>>>>>
> >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> >>>>>>>>> enough.
> >>>>>>>>>
> >>>>>>>>> There are two related effect in that commit for your issue:
> >>>>>>>>>
> >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> >>>>>>>>> request
> >>>>>>>>>
> >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> >>>>>>>>> returns STS_RESOURCE
> >>>>>>>>>
> >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> >>>>>>>>> SAS.
> >>>>>>>>>
> >>>>>>>>> So please test the following two patches, and see which one makes a
> >>>>>>>>> difference for you.
> >>>>>>>>>
> >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> >>>>>>>>>
> >>>>>>>>> 1) patch 1
> >>>>>>>>>
> >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >>>>>>>>> index 856356b1619e..86137c75283c 100644
> >>>>>>>>> --- a/block/blk-mq-sched.c
> >>>>>>>>> +++ b/block/blk-mq-sched.c
> >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> >>>>>>>>>  
> >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> >>>>>>>>>  		goto run;
> >>>>>>>>>  	}
> >>>>>>>>
> >>>>>>>> Ming, thank you for the trial patches.
> >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2) patch 2
> >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> >>>>>>>>> --- a/block/blk-mq.c
> >>>>>>>>> +++ b/block/blk-mq.c
> >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> >>>>>>>>>  
> >>>>>>>>>  		spin_lock(&hctx->lock);
> >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> >>>>>>>>>  		spin_unlock(&hctx->lock);
> >>>>>>>>>  
> >>>>>>>>>  		/*
> >>>>>>>>
> >>>>>>>> This patch 2 didn't reduce the runtime.
> >>>>>>>>
> >>>>>>>> Wish this report helps.
> >>>>>>>
> >>>>>>> Your feedback does help, then please test the following patch:
> >>>>>>
> >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> >>>>>> it reduces the runtime as short as rc3. Good.
> >>>>>
> >>>>> Hi Shinichiro,
> >>>>>
> >>>>> Thanks for your test!
> >>>>>
> >>>>> Then I think the following change should make the difference actually,
> >>>>> you may double check that and confirm if it is that.
> >>>>>
> >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> >>>>>>  	flush_rq->end_io = flush_end_io;
> >>>>>>  
> >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> >>>>
> >>>> Yes, with this the one line change above only, the runtime was reduced.
> >>>>
> >>>>>
> >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> >>>>> correct.
> >>>>>
> >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> >>>>> submitted to LLD.
> >>>>>
> >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> >>>>> the following trace via the following script with/without the single line
> >>>>> patch?
> >>>>
> >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> >>>> with and without the one liner patch above. The gzip archived trace files have
> >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> >>>> want the trace files shared.
> >>>
> >>> I didn't thought the trace can be so big given the ios should be just
> >>> 256 * 64(1564).
> >>>
> >>> You may put the log somewhere in Internet, cloud storage, web, or
> >>> whatever. Then just provides us the link.
> >>>
> >>> Or if you can't find a place to hold it, just send to me, and I will put
> >>> it in my RH people web link.
> >>
> >> I have sent another e-mail only to you attaching the log files gziped.
> >> Your confirmation will be appreciated.
> > 
> > Yeah, I got the log, and it has been put in the following link:
> > 
> > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > 
> > Also I have run some analysis, and block/004 basically call pwrite() &
> > fsync() in each job.
> > 
> > 1) v5.6-rc kernel
> > - write rq average latency: 0.091s 
> > - flush rq average latency: 0.018s
> > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > - average issue times of flush request: 1.052
> > 
> > 2) v5.6-rc patched kernel
> > - write rq average latency: 0.031
> > - flush rq average latency: 0.035
> > - average issue times of write request: 1.466
> > - average issue times of flush request: 1.610
> > 
> > 
> > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > queue can only move on after one request is completed.
> > 
> > Looks the flush request takes shorter time than normal write IO.
> > If flush request is added to front of dispatch queue, the next normal
> > write IO could be queued to lld quicker than adding to tail of dispatch
> > queue.
> > trace_block_rq_issue() is called more than one time is because of
> > AHCI or the drive's implementation. It usually means that
> > host->hostt->queuecommand() fails for several times when queuing one
> > single request. For AHCI, I understand it shouldn't fail normally given
> > we guarantee that the actual queue depth is <= either LUN or host's
> > queue depth. Maybe it depends on your SMR's implementation about handling
> > flush/write IO. Could you check why .queuecommand() fails in block/004?

I put some debug prints and confirmed that the .queuecommand function is
ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
upper layer only knows the queue depth, we are responsible for maintaining
exclusion.  This function checks whether a new command @qc can be issued." Then
I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
write command are waiting the completion each other.

> 
> Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> the AHCI code beside disk probe checks. So write & flush handling does not
> differ between SMR and regular disks. The same applies to the drive side. write
> and flush commands are the normal commands, no change at all. The only
> difference being the sequential write constraint which the drive honors by not
> executing the queued write command out of order. But there are no constraint for
> flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> 
> I wonder if the issue may be with the particular AHCI chipset used for this test.
> 
> > 
> > Also can you provide queue flags via the following command?
> > 
> > 	cat /sys/kernel/debug/block/sdN/state

The state sysfs attribute was as follows:

SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26

It didn't change before and after the block/004 run.


> > 
> > I understand flush request should be slower than normal write, however
> > looks not true in this hardware.
> 
> Probably due to the fact that since the writes are sequential, there is a lot of
> drive internal optimization that can be done to minimize the cost of flush
> (internal data streaming from cache to media, media-cache use, etc) That is true
> for regular disks too. And all of this is highly dependent on the hardware
> implementation.

This discussion tempted me to take closer look in the traces. And I noticed that
number of flush commands issued is different with and without the patch.

                        | without patch | with patch
------------------------+---------------+------------
block_getrq: rwbs=FWS   |      32640    |   32640
block_rq_issue: rwbs=FF |      32101    |    7593

Without the patch, flush command is issued between two write commands. With the
patch, some write commands are executed without flush between them.

I wonder how the requeue list position of flush command (head vs tail) changes
the number of flush commands to issue.

Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
doubles number of writes (256 * 64 = 16320). I will chase this further.


> Shinichiro,
> 
> It may be worth trying the same run with & without Ming's patch on a machine
> with a different chipset...

Thanks for the advice. So far, I observer the long block/004 runtime on two
systems. One with Intel C220 Series SATA controller, and the other with Intel
200 Series PCH SATA controller. I will try to find other SATA controller.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10  5:54                         ` Shinichiro Kawasaki
@ 2020-03-10  6:00                           ` Damien Le Moal
  2020-03-10  8:07                           ` Ming Lei
  1 sibling, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2020-03-10  6:00 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Ming Lei, linux-block, Jens Axboe

On 2020/03/10 14:54, Shinichiro Kawasaki wrote:
[...]
>>> Yeah, I got the log, and it has been put in the following link:
>>>
>>> http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
>>>
>>> Also I have run some analysis, and block/004 basically call pwrite() &
>>> fsync() in each job.
>>>
>>> 1) v5.6-rc kernel
>>> - write rq average latency: 0.091s 
>>> - flush rq average latency: 0.018s
>>> - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
>>> - average issue times of flush request: 1.052
>>>
>>> 2) v5.6-rc patched kernel
>>> - write rq average latency: 0.031
>>> - flush rq average latency: 0.035
>>> - average issue times of write request: 1.466
>>> - average issue times of flush request: 1.610
>>>
>>>
>>> block/004 starts 64 jobs and AHCI's queue can become saturated easily,
>>> then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
>>> queue can only move on after one request is completed.
>>>
>>> Looks the flush request takes shorter time than normal write IO.
>>> If flush request is added to front of dispatch queue, the next normal
>>> write IO could be queued to lld quicker than adding to tail of dispatch
>>> queue.
>>> trace_block_rq_issue() is called more than one time is because of
>>> AHCI or the drive's implementation. It usually means that
>>> host->hostt->queuecommand() fails for several times when queuing one
>>> single request. For AHCI, I understand it shouldn't fail normally given
>>> we guarantee that the actual queue depth is <= either LUN or host's
>>> queue depth. Maybe it depends on your SMR's implementation about handling
>>> flush/write IO. Could you check why .queuecommand() fails in block/004?
> 
> I put some debug prints and confirmed that the .queuecommand function is
> ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> upper layer only knows the queue depth, we are responsible for maintaining
> exclusion.  This function checks whether a new command @qc can be issued." Then
> I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> write command are waiting the completion each other.
> 
>>
>> Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
>> the AHCI code beside disk probe checks. So write & flush handling does not
>> differ between SMR and regular disks. The same applies to the drive side. write
>> and flush commands are the normal commands, no change at all. The only
>> difference being the sequential write constraint which the drive honors by not
>> executing the queued write command out of order. But there are no constraint for
>> flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
>>
>> I wonder if the issue may be with the particular AHCI chipset used for this test.
>>
>>>
>>> Also can you provide queue flags via the following command?
>>>
>>> 	cat /sys/kernel/debug/block/sdN/state
> 
> The state sysfs attribute was as follows:
> 
> SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> 
> It didn't change before and after the block/004 run.
> 
> 
>>>
>>> I understand flush request should be slower than normal write, however
>>> looks not true in this hardware.
>>
>> Probably due to the fact that since the writes are sequential, there is a lot of
>> drive internal optimization that can be done to minimize the cost of flush
>> (internal data streaming from cache to media, media-cache use, etc) That is true
>> for regular disks too. And all of this is highly dependent on the hardware
>> implementation.
> 
> This discussion tempted me to take closer look in the traces. And I noticed that
> number of flush commands issued is different with and without the patch.
> 
>                         | without patch | with patch
> ------------------------+---------------+------------
> block_getrq: rwbs=FWS   |      32640    |   32640
> block_rq_issue: rwbs=FF |      32101    |    7593
> 
> Without the patch, flush command is issued between two write commands. With the
> patch, some write commands are executed without flush between them.
> 
> I wonder how the requeue list position of flush command (head vs tail) changes
> the number of flush commands to issue.
> 
> Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> doubles number of writes (256 * 64 = 16320). I will chase this further.
> 
> 
>> Shinichiro,
>>
>> It may be worth trying the same run with & without Ming's patch on a machine
>> with a different chipset...
> 
> Thanks for the advice. So far, I observer the long block/004 runtime on two
> systems. One with Intel C220 Series SATA controller, and the other with Intel
> 200 Series PCH SATA controller. I will try to find other SATA controller.

I do not think it is necessary. I forgot that flush cache is a non-ncq command.
That is why queuecommand fails as you found out: it is waiting for all NCQ
commands to complete first. Nothing to do with the chipset type. This is per ATA
specs.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10  5:54                         ` Shinichiro Kawasaki
  2020-03-10  6:00                           ` Damien Le Moal
@ 2020-03-10  8:07                           ` Ming Lei
  2020-03-10 11:07                             ` Shinichiro Kawasaki
  1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-10  8:07 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> Ming, thank you for sharing the log files and analysis.
> 
> On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > On 2020/03/10 1:14, Ming Lei wrote:
> > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > >>>>>>> Hi Shinichiro,
> > >>>>>>>
> > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > >>>>>>>>>>>> commit makes the test case longer.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > >>>>>>>>>>>> expected?
> > >>>>>>>>>>>
> > >>>>>>>>>>> The following patch might address this issue:
> > >>>>>>>>>>>
> > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > >>>>>>>>>>>
> > >>>>>>>>>>> Please test and provide us the result.
> > >>>>>>>>>>>
> > >>>>>>>>>>> thanks,
> > >>>>>>>>>>> Ming
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Ming,
> > >>>>>>>>>>
> > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > >>>>>>>>>> Still it takes around 300 seconds.
> > >>>>>>>>>
> > >>>>>>>>> Hello Shinichiro,
> > >>>>>>>>>
> > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > >>>>>>>>> enough.
> > >>>>>>>>>
> > >>>>>>>>> There are two related effect in that commit for your issue:
> > >>>>>>>>>
> > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > >>>>>>>>> request
> > >>>>>>>>>
> > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > >>>>>>>>> returns STS_RESOURCE
> > >>>>>>>>>
> > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > >>>>>>>>> SAS.
> > >>>>>>>>>
> > >>>>>>>>> So please test the following two patches, and see which one makes a
> > >>>>>>>>> difference for you.
> > >>>>>>>>>
> > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > >>>>>>>>>
> > >>>>>>>>> 1) patch 1
> > >>>>>>>>>
> > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > >>>>>>>>> --- a/block/blk-mq-sched.c
> > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > >>>>>>>>>  
> > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > >>>>>>>>>  		goto run;
> > >>>>>>>>>  	}
> > >>>>>>>>
> > >>>>>>>> Ming, thank you for the trial patches.
> > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> 2) patch 2
> > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > >>>>>>>>> --- a/block/blk-mq.c
> > >>>>>>>>> +++ b/block/blk-mq.c
> > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > >>>>>>>>>  
> > >>>>>>>>>  		spin_lock(&hctx->lock);
> > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > >>>>>>>>>  
> > >>>>>>>>>  		/*
> > >>>>>>>>
> > >>>>>>>> This patch 2 didn't reduce the runtime.
> > >>>>>>>>
> > >>>>>>>> Wish this report helps.
> > >>>>>>>
> > >>>>>>> Your feedback does help, then please test the following patch:
> > >>>>>>
> > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > >>>>>> it reduces the runtime as short as rc3. Good.
> > >>>>>
> > >>>>> Hi Shinichiro,
> > >>>>>
> > >>>>> Thanks for your test!
> > >>>>>
> > >>>>> Then I think the following change should make the difference actually,
> > >>>>> you may double check that and confirm if it is that.
> > >>>>>
> > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > >>>>>>  	flush_rq->end_io = flush_end_io;
> > >>>>>>  
> > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > >>>>
> > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > >>>>
> > >>>>>
> > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > >>>>> correct.
> > >>>>>
> > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > >>>>> submitted to LLD.
> > >>>>>
> > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > >>>>> the following trace via the following script with/without the single line
> > >>>>> patch?
> > >>>>
> > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > >>>> with and without the one liner patch above. The gzip archived trace files have
> > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > >>>> want the trace files shared.
> > >>>
> > >>> I didn't thought the trace can be so big given the ios should be just
> > >>> 256 * 64(1564).
> > >>>
> > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > >>> whatever. Then just provides us the link.
> > >>>
> > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > >>> it in my RH people web link.
> > >>
> > >> I have sent another e-mail only to you attaching the log files gziped.
> > >> Your confirmation will be appreciated.
> > > 
> > > Yeah, I got the log, and it has been put in the following link:
> > > 
> > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > 
> > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > fsync() in each job.
> > > 
> > > 1) v5.6-rc kernel
> > > - write rq average latency: 0.091s 
> > > - flush rq average latency: 0.018s
> > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > - average issue times of flush request: 1.052
> > > 
> > > 2) v5.6-rc patched kernel
> > > - write rq average latency: 0.031
> > > - flush rq average latency: 0.035
> > > - average issue times of write request: 1.466
> > > - average issue times of flush request: 1.610
> > > 
> > > 
> > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > queue can only move on after one request is completed.
> > > 
> > > Looks the flush request takes shorter time than normal write IO.
> > > If flush request is added to front of dispatch queue, the next normal
> > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > queue.
> > > trace_block_rq_issue() is called more than one time is because of
> > > AHCI or the drive's implementation. It usually means that
> > > host->hostt->queuecommand() fails for several times when queuing one
> > > single request. For AHCI, I understand it shouldn't fail normally given
> > > we guarantee that the actual queue depth is <= either LUN or host's
> > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> 
> I put some debug prints and confirmed that the .queuecommand function is
> ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> upper layer only knows the queue depth, we are responsible for maintaining
> exclusion.  This function checks whether a new command @qc can be issued." Then
> I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> write command are waiting the completion each other.

OK, got it.

> 
> > 
> > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > the AHCI code beside disk probe checks. So write & flush handling does not
> > differ between SMR and regular disks. The same applies to the drive side. write
> > and flush commands are the normal commands, no change at all. The only
> > difference being the sequential write constraint which the drive honors by not
> > executing the queued write command out of order. But there are no constraint for
> > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > 
> > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > 
> > > 
> > > Also can you provide queue flags via the following command?
> > > 
> > > 	cat /sys/kernel/debug/block/sdN/state
> 
> The state sysfs attribute was as follows:
> 
> SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> 
> It didn't change before and after the block/004 run.
> 
> 
> > > 
> > > I understand flush request should be slower than normal write, however
> > > looks not true in this hardware.
> > 
> > Probably due to the fact that since the writes are sequential, there is a lot of
> > drive internal optimization that can be done to minimize the cost of flush
> > (internal data streaming from cache to media, media-cache use, etc) That is true
> > for regular disks too. And all of this is highly dependent on the hardware
> > implementation.
> 
> This discussion tempted me to take closer look in the traces. And I noticed that
> number of flush commands issued is different with and without the patch.
> 
>                         | without patch | with patch
> ------------------------+---------------+------------
> block_getrq: rwbs=FWS   |      32640    |   32640
> block_rq_issue: rwbs=FF |      32101    |    7593

Looks issued flush request is too many given the flush machinery
should avoid to queue duplicated flush requests.

I will investigate the flush code a bit.

> 
> Without the patch, flush command is issued between two write commands. With the
> patch, some write commands are executed without flush between them.
> 
> I wonder how the requeue list position of flush command (head vs tail) changes
> the number of flush commands to issue.
> 
> Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> doubles number of writes (256 * 64 = 16320). I will chase this further.

Indeed, not see such issue when I run the test on kvm ahci.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10  8:07                           ` Ming Lei
@ 2020-03-10 11:07                             ` Shinichiro Kawasaki
  2020-03-10 13:37                               ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-10 11:07 UTC (permalink / raw)
  To: Ming Lei; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Mar 10, 2020 / 16:07, Ming Lei wrote:
> On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> > Ming, thank you for sharing the log files and analysis.
> > 
> > On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > > On 2020/03/10 1:14, Ming Lei wrote:
> > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > >>>>>>> Hi Shinichiro,
> > > >>>>>>>
> > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > > >>>>>>>>>>>> commit makes the test case longer.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > > >>>>>>>>>>>> expected?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> The following patch might address this issue:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Please test and provide us the result.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> thanks,
> > > >>>>>>>>>>> Ming
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hi Ming,
> > > >>>>>>>>>>
> > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > >>>>>>>>>> Still it takes around 300 seconds.
> > > >>>>>>>>>
> > > >>>>>>>>> Hello Shinichiro,
> > > >>>>>>>>>
> > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > >>>>>>>>> enough.
> > > >>>>>>>>>
> > > >>>>>>>>> There are two related effect in that commit for your issue:
> > > >>>>>>>>>
> > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > >>>>>>>>> request
> > > >>>>>>>>>
> > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > >>>>>>>>> returns STS_RESOURCE
> > > >>>>>>>>>
> > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > > >>>>>>>>> SAS.
> > > >>>>>>>>>
> > > >>>>>>>>> So please test the following two patches, and see which one makes a
> > > >>>>>>>>> difference for you.
> > > >>>>>>>>>
> > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > > >>>>>>>>>
> > > >>>>>>>>> 1) patch 1
> > > >>>>>>>>>
> > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > > >>>>>>>>> --- a/block/blk-mq-sched.c
> > > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > > >>>>>>>>>  
> > > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > > >>>>>>>>>  		goto run;
> > > >>>>>>>>>  	}
> > > >>>>>>>>
> > > >>>>>>>> Ming, thank you for the trial patches.
> > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > > >>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> 2) patch 2
> > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > > >>>>>>>>> --- a/block/blk-mq.c
> > > >>>>>>>>> +++ b/block/blk-mq.c
> > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > > >>>>>>>>>  
> > > >>>>>>>>>  		spin_lock(&hctx->lock);
> > > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > > >>>>>>>>>  
> > > >>>>>>>>>  		/*
> > > >>>>>>>>
> > > >>>>>>>> This patch 2 didn't reduce the runtime.
> > > >>>>>>>>
> > > >>>>>>>> Wish this report helps.
> > > >>>>>>>
> > > >>>>>>> Your feedback does help, then please test the following patch:
> > > >>>>>>
> > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > >>>>>> it reduces the runtime as short as rc3. Good.
> > > >>>>>
> > > >>>>> Hi Shinichiro,
> > > >>>>>
> > > >>>>> Thanks for your test!
> > > >>>>>
> > > >>>>> Then I think the following change should make the difference actually,
> > > >>>>> you may double check that and confirm if it is that.
> > > >>>>>
> > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > > >>>>>>  	flush_rq->end_io = flush_end_io;
> > > >>>>>>  
> > > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > > >>>>
> > > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > > >>>>
> > > >>>>>
> > > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > >>>>> correct.
> > > >>>>>
> > > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > > >>>>> submitted to LLD.
> > > >>>>>
> > > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > > >>>>> the following trace via the following script with/without the single line
> > > >>>>> patch?
> > > >>>>
> > > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > > >>>> with and without the one liner patch above. The gzip archived trace files have
> > > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > >>>> want the trace files shared.
> > > >>>
> > > >>> I didn't thought the trace can be so big given the ios should be just
> > > >>> 256 * 64(1564).
> > > >>>
> > > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > > >>> whatever. Then just provides us the link.
> > > >>>
> > > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > > >>> it in my RH people web link.
> > > >>
> > > >> I have sent another e-mail only to you attaching the log files gziped.
> > > >> Your confirmation will be appreciated.
> > > > 
> > > > Yeah, I got the log, and it has been put in the following link:
> > > > 
> > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > > 
> > > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > > fsync() in each job.
> > > > 
> > > > 1) v5.6-rc kernel
> > > > - write rq average latency: 0.091s 
> > > > - flush rq average latency: 0.018s
> > > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > > - average issue times of flush request: 1.052
> > > > 
> > > > 2) v5.6-rc patched kernel
> > > > - write rq average latency: 0.031
> > > > - flush rq average latency: 0.035
> > > > - average issue times of write request: 1.466
> > > > - average issue times of flush request: 1.610
> > > > 
> > > > 
> > > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > > queue can only move on after one request is completed.
> > > > 
> > > > Looks the flush request takes shorter time than normal write IO.
> > > > If flush request is added to front of dispatch queue, the next normal
> > > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > > queue.
> > > > trace_block_rq_issue() is called more than one time is because of
> > > > AHCI or the drive's implementation. It usually means that
> > > > host->hostt->queuecommand() fails for several times when queuing one
> > > > single request. For AHCI, I understand it shouldn't fail normally given
> > > > we guarantee that the actual queue depth is <= either LUN or host's
> > > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> > 
> > I put some debug prints and confirmed that the .queuecommand function is
> > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> > notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> > upper layer only knows the queue depth, we are responsible for maintaining
> > exclusion.  This function checks whether a new command @qc can be issued." Then
> > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> > write command are waiting the completion each other.
> 
> OK, got it.
> 
> > 
> > > 
> > > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > > the AHCI code beside disk probe checks. So write & flush handling does not
> > > differ between SMR and regular disks. The same applies to the drive side. write
> > > and flush commands are the normal commands, no change at all. The only
> > > difference being the sequential write constraint which the drive honors by not
> > > executing the queued write command out of order. But there are no constraint for
> > > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > > 
> > > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > > 
> > > > 
> > > > Also can you provide queue flags via the following command?
> > > > 
> > > > 	cat /sys/kernel/debug/block/sdN/state
> > 
> > The state sysfs attribute was as follows:
> > 
> > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> > 
> > It didn't change before and after the block/004 run.
> > 
> > 
> > > > 
> > > > I understand flush request should be slower than normal write, however
> > > > looks not true in this hardware.
> > > 
> > > Probably due to the fact that since the writes are sequential, there is a lot of
> > > drive internal optimization that can be done to minimize the cost of flush
> > > (internal data streaming from cache to media, media-cache use, etc) That is true
> > > for regular disks too. And all of this is highly dependent on the hardware
> > > implementation.
> > 
> > This discussion tempted me to take closer look in the traces. And I noticed that
> > number of flush commands issued is different with and without the patch.
> > 
> >                         | without patch | with patch
> > ------------------------+---------------+------------
> > block_getrq: rwbs=FWS   |      32640    |   32640
> > block_rq_issue: rwbs=FF |      32101    |    7593
> 
> Looks issued flush request is too many given the flush machinery
> should avoid to queue duplicated flush requests.
> 
> I will investigate the flush code a bit.
> 
> > 
> > Without the patch, flush command is issued between two write commands. With the
> > patch, some write commands are executed without flush between them.
> > 
> > I wonder how the requeue list position of flush command (head vs tail) changes
> > the number of flush commands to issue.
> > 
> > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> > doubles number of writes (256 * 64 = 16320). I will chase this further.
> 
> Indeed, not see such issue when I run the test on kvm ahci.

I found that the cause of the doubled flush commands is dm-linear device I set
up. I mapped two areas of single SMR drive to the dm-linear device, using dm
table with two lines. My understanding is that when a flush is requested to the
dm-linear device, the flush request is duplicated and forwarded to  destination
devices listed in dm table lines. In this case, a flush request is duplicated
for two lines in dm table, and both flush requests go to the single SMR drive.
This doubles the number of flush commands. I changed the dm table to have single
line to map single area, and observed the number of flush commands (block_getrq
with rwbs=FWS) got reduced to 16320, the same number as the write commands.

This means that the workload I observed the longer runtime of block/004 with
kernel 5.6-rc4 is really flush intensive. It has two flush commands per one
write command. It might be too sensitive to flush command behavior. I'm not so
sure if this is the workload worth performance care.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10 11:07                             ` Shinichiro Kawasaki
@ 2020-03-10 13:37                               ` Ming Lei
  2020-03-10 14:37                                 ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-10 13:37 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Tue, Mar 10, 2020 at 11:07:04AM +0000, Shinichiro Kawasaki wrote:
> On Mar 10, 2020 / 16:07, Ming Lei wrote:
> > On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> > > Ming, thank you for sharing the log files and analysis.
> > > 
> > > On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > > > On 2020/03/10 1:14, Ming Lei wrote:
> > > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > >>>>>>> Hi Shinichiro,
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > > > >>>>>>>>>>>> commit makes the test case longer.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > >>>>>>>>>>>> expected?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> The following patch might address this issue:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Please test and provide us the result.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> thanks,
> > > > >>>>>>>>>>> Ming
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Hi Ming,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > >>>>>>>>>> Still it takes around 300 seconds.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Hello Shinichiro,
> > > > >>>>>>>>>
> > > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > >>>>>>>>> enough.
> > > > >>>>>>>>>
> > > > >>>>>>>>> There are two related effect in that commit for your issue:
> > > > >>>>>>>>>
> > > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > >>>>>>>>> request
> > > > >>>>>>>>>
> > > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > >>>>>>>>> returns STS_RESOURCE
> > > > >>>>>>>>>
> > > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > > > >>>>>>>>> SAS.
> > > > >>>>>>>>>
> > > > >>>>>>>>> So please test the following two patches, and see which one makes a
> > > > >>>>>>>>> difference for you.
> > > > >>>>>>>>>
> > > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > >>>>>>>>>
> > > > >>>>>>>>> 1) patch 1
> > > > >>>>>>>>>
> > > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > > > >>>>>>>>> --- a/block/blk-mq-sched.c
> > > > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > > > >>>>>>>>>  
> > > > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > > > >>>>>>>>>  		goto run;
> > > > >>>>>>>>>  	}
> > > > >>>>>>>>
> > > > >>>>>>>> Ming, thank you for the trial patches.
> > > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > > > >>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> 2) patch 2
> > > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > > > >>>>>>>>> --- a/block/blk-mq.c
> > > > >>>>>>>>> +++ b/block/blk-mq.c
> > > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > > > >>>>>>>>>  
> > > > >>>>>>>>>  		spin_lock(&hctx->lock);
> > > > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > > > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > > > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > > > >>>>>>>>>  
> > > > >>>>>>>>>  		/*
> > > > >>>>>>>>
> > > > >>>>>>>> This patch 2 didn't reduce the runtime.
> > > > >>>>>>>>
> > > > >>>>>>>> Wish this report helps.
> > > > >>>>>>>
> > > > >>>>>>> Your feedback does help, then please test the following patch:
> > > > >>>>>>
> > > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > >>>>>> it reduces the runtime as short as rc3. Good.
> > > > >>>>>
> > > > >>>>> Hi Shinichiro,
> > > > >>>>>
> > > > >>>>> Thanks for your test!
> > > > >>>>>
> > > > >>>>> Then I think the following change should make the difference actually,
> > > > >>>>> you may double check that and confirm if it is that.
> > > > >>>>>
> > > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > > > >>>>>>  	flush_rq->end_io = flush_end_io;
> > > > >>>>>>  
> > > > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > > > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > > > >>>>
> > > > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > > > >>>>
> > > > >>>>>
> > > > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > > >>>>> correct.
> > > > >>>>>
> > > > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > > > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > > > >>>>> submitted to LLD.
> > > > >>>>>
> > > > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > > > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > > > >>>>> the following trace via the following script with/without the single line
> > > > >>>>> patch?
> > > > >>>>
> > > > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > > > >>>> with and without the one liner patch above. The gzip archived trace files have
> > > > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > > >>>> want the trace files shared.
> > > > >>>
> > > > >>> I didn't thought the trace can be so big given the ios should be just
> > > > >>> 256 * 64(1564).
> > > > >>>
> > > > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > > > >>> whatever. Then just provides us the link.
> > > > >>>
> > > > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > > > >>> it in my RH people web link.
> > > > >>
> > > > >> I have sent another e-mail only to you attaching the log files gziped.
> > > > >> Your confirmation will be appreciated.
> > > > > 
> > > > > Yeah, I got the log, and it has been put in the following link:
> > > > > 
> > > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > > > 
> > > > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > > > fsync() in each job.
> > > > > 
> > > > > 1) v5.6-rc kernel
> > > > > - write rq average latency: 0.091s 
> > > > > - flush rq average latency: 0.018s
> > > > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > > > - average issue times of flush request: 1.052
> > > > > 
> > > > > 2) v5.6-rc patched kernel
> > > > > - write rq average latency: 0.031
> > > > > - flush rq average latency: 0.035
> > > > > - average issue times of write request: 1.466
> > > > > - average issue times of flush request: 1.610
> > > > > 
> > > > > 
> > > > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > > > queue can only move on after one request is completed.
> > > > > 
> > > > > Looks the flush request takes shorter time than normal write IO.
> > > > > If flush request is added to front of dispatch queue, the next normal
> > > > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > > > queue.
> > > > > trace_block_rq_issue() is called more than one time is because of
> > > > > AHCI or the drive's implementation. It usually means that
> > > > > host->hostt->queuecommand() fails for several times when queuing one
> > > > > single request. For AHCI, I understand it shouldn't fail normally given
> > > > > we guarantee that the actual queue depth is <= either LUN or host's
> > > > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> > > 
> > > I put some debug prints and confirmed that the .queuecommand function is
> > > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> > > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> > > notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> > > upper layer only knows the queue depth, we are responsible for maintaining
> > > exclusion.  This function checks whether a new command @qc can be issued." Then
> > > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> > > write command are waiting the completion each other.
> > 
> > OK, got it.
> > 
> > > 
> > > > 
> > > > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > > > the AHCI code beside disk probe checks. So write & flush handling does not
> > > > differ between SMR and regular disks. The same applies to the drive side. write
> > > > and flush commands are the normal commands, no change at all. The only
> > > > difference being the sequential write constraint which the drive honors by not
> > > > executing the queued write command out of order. But there are no constraint for
> > > > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > > > 
> > > > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > > > 
> > > > > 
> > > > > Also can you provide queue flags via the following command?
> > > > > 
> > > > > 	cat /sys/kernel/debug/block/sdN/state
> > > 
> > > The state sysfs attribute was as follows:
> > > 
> > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> > > 
> > > It didn't change before and after the block/004 run.
> > > 
> > > 
> > > > > 
> > > > > I understand flush request should be slower than normal write, however
> > > > > looks not true in this hardware.
> > > > 
> > > > Probably due to the fact that since the writes are sequential, there is a lot of
> > > > drive internal optimization that can be done to minimize the cost of flush
> > > > (internal data streaming from cache to media, media-cache use, etc) That is true
> > > > for regular disks too. And all of this is highly dependent on the hardware
> > > > implementation.
> > > 
> > > This discussion tempted me to take closer look in the traces. And I noticed that
> > > number of flush commands issued is different with and without the patch.
> > > 
> > >                         | without patch | with patch
> > > ------------------------+---------------+------------
> > > block_getrq: rwbs=FWS   |      32640    |   32640
> > > block_rq_issue: rwbs=FF |      32101    |    7593
> > 
> > Looks issued flush request is too many given the flush machinery
> > should avoid to queue duplicated flush requests.
> > 
> > I will investigate the flush code a bit.
> > 
> > > 
> > > Without the patch, flush command is issued between two write commands. With the
> > > patch, some write commands are executed without flush between them.
> > > 
> > > I wonder how the requeue list position of flush command (head vs tail) changes
> > > the number of flush commands to issue.
> > > 
> > > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> > > doubles number of writes (256 * 64 = 16320). I will chase this further.
> > 
> > Indeed, not see such issue when I run the test on kvm ahci.
> 
> I found that the cause of the doubled flush commands is dm-linear device I set
> up. I mapped two areas of single SMR drive to the dm-linear device, using dm
> table with two lines. My understanding is that when a flush is requested to the
> dm-linear device, the flush request is duplicated and forwarded to  destination
> devices listed in dm table lines. In this case, a flush request is duplicated
> for two lines in dm table, and both flush requests go to the single SMR drive.

Got it, maybe dm linear should avoid to send two flush requests to
single drive.

Can you observe similar issue when running block/004 over the AHCI/SMR
directly?

> This doubles the number of flush commands. I changed the dm table to have single
> line to map single area, and observed the number of flush commands (block_getrq
> with rwbs=FWS) got reduced to 16320, the same number as the write commands.
> 
> This means that the workload I observed the longer runtime of block/004 with
> kernel 5.6-rc4 is really flush intensive. It has two flush commands per one
> write command. It might be too sensitive to flush command behavior. I'm not so
> sure if this is the workload worth performance care.

The story may be something like below:

1) sometime, suppose there are two write request(w0, w1) in
dispatch queue(hctx->dispatch), and one flush request is just done,
so blk-mq starts to dispatch requests in hctx->dispatch via wq

2) a new flush request F is scheduled via requeue wq, and the flush
request is added to front of hctx->dispatch, which includes (F, w0, w1)

3) meantime, given there are 64 fio jobs, one of the jobs just run queue 
and queue one rq from scheduler to lld, and hctx->dispatch isn't observed
because of timing, so sata switches to NCQ mode

4) step 1 or 2 starts to run queue, and try to dispatch requests(F, w0, w1).
.queue_rq() returns STS_RESOURCE for F because F is non-NCQ command,
so hctx->dispatch becomes (w0, w1, F) now. And the queue will be re-run
after one write request is completed, so one write IO latency is added
to the flush request F.

5) when flush request latency is increased, chance for flush request
merge is increased, so much less flush requests are issued to sata
in patched kernel, then issuing time for write requests is decreased.

If flush request is added to tail of hctx->dispatch in step 2), one
extra IO latency won't be added to flush request, and flush request
can be completed in less time, meantime chance of flush request merge
is reduced much. When more flush requests are issued to sata, more time
is introduced to issuing write IOs.

Anyway, this problem is NCQ specific. If it can be reproduced on
AHCI/SMR directly, we may need to fix it.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10 13:37                               ` Ming Lei
@ 2020-03-10 14:37                                 ` Ming Lei
  2020-03-11  4:59                                   ` Shinichiro Kawasaki
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2020-03-10 14:37 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Tue, Mar 10, 2020 at 09:37:24PM +0800, Ming Lei wrote:
> On Tue, Mar 10, 2020 at 11:07:04AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 10, 2020 / 16:07, Ming Lei wrote:
> > > On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> > > > Ming, thank you for sharing the log files and analysis.
> > > > 
> > > > On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > > > > On 2020/03/10 1:14, Ming Lei wrote:
> > > > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > > > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > > > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > > >>>>>>> Hi Shinichiro,
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > > > > >>>>>>>>>>>> commit makes the test case longer.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > >>>>>>>>>>>> expected?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> The following patch might address this issue:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Please test and provide us the result.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> thanks,
> > > > > >>>>>>>>>>> Ming
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Hi Ming,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > >>>>>>>>>> Still it takes around 300 seconds.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Hello Shinichiro,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > >>>>>>>>> enough.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> There are two related effect in that commit for your issue:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > >>>>>>>>> request
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > >>>>>>>>> returns STS_RESOURCE
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > >>>>>>>>> SAS.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> So please test the following two patches, and see which one makes a
> > > > > >>>>>>>>> difference for you.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> 1) patch 1
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > > > > >>>>>>>>> --- a/block/blk-mq-sched.c
> > > > > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > > > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > > > > >>>>>>>>>  
> > > > > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > >>>>>>>>>  		goto run;
> > > > > >>>>>>>>>  	}
> > > > > >>>>>>>>
> > > > > >>>>>>>> Ming, thank you for the trial patches.
> > > > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > > > > >>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> 2) patch 2
> > > > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > > > > >>>>>>>>> --- a/block/blk-mq.c
> > > > > >>>>>>>>> +++ b/block/blk-mq.c
> > > > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > > > > >>>>>>>>>  
> > > > > >>>>>>>>>  		spin_lock(&hctx->lock);
> > > > > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > > > > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > > > > >>>>>>>>>  
> > > > > >>>>>>>>>  		/*
> > > > > >>>>>>>>
> > > > > >>>>>>>> This patch 2 didn't reduce the runtime.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Wish this report helps.
> > > > > >>>>>>>
> > > > > >>>>>>> Your feedback does help, then please test the following patch:
> > > > > >>>>>>
> > > > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > > >>>>>> it reduces the runtime as short as rc3. Good.
> > > > > >>>>>
> > > > > >>>>> Hi Shinichiro,
> > > > > >>>>>
> > > > > >>>>> Thanks for your test!
> > > > > >>>>>
> > > > > >>>>> Then I think the following change should make the difference actually,
> > > > > >>>>> you may double check that and confirm if it is that.
> > > > > >>>>>
> > > > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > > > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > > > > >>>>>>  	flush_rq->end_io = flush_end_io;
> > > > > >>>>>>  
> > > > > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > > > > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > > > > >>>>
> > > > > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > > > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > > > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > > > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > > > >>>>> correct.
> > > > > >>>>>
> > > > > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > > > > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > > > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > > > > >>>>> submitted to LLD.
> > > > > >>>>>
> > > > > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > > > > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > > > > >>>>> the following trace via the following script with/without the single line
> > > > > >>>>> patch?
> > > > > >>>>
> > > > > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > > > > >>>> with and without the one liner patch above. The gzip archived trace files have
> > > > > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > > > >>>> want the trace files shared.
> > > > > >>>
> > > > > >>> I didn't thought the trace can be so big given the ios should be just
> > > > > >>> 256 * 64(1564).
> > > > > >>>
> > > > > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > > > > >>> whatever. Then just provides us the link.
> > > > > >>>
> > > > > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > > > > >>> it in my RH people web link.
> > > > > >>
> > > > > >> I have sent another e-mail only to you attaching the log files gziped.
> > > > > >> Your confirmation will be appreciated.
> > > > > > 
> > > > > > Yeah, I got the log, and it has been put in the following link:
> > > > > > 
> > > > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > > > > 
> > > > > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > > > > fsync() in each job.
> > > > > > 
> > > > > > 1) v5.6-rc kernel
> > > > > > - write rq average latency: 0.091s 
> > > > > > - flush rq average latency: 0.018s
> > > > > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > > > > - average issue times of flush request: 1.052
> > > > > > 
> > > > > > 2) v5.6-rc patched kernel
> > > > > > - write rq average latency: 0.031
> > > > > > - flush rq average latency: 0.035
> > > > > > - average issue times of write request: 1.466
> > > > > > - average issue times of flush request: 1.610
> > > > > > 
> > > > > > 
> > > > > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > > > > queue can only move on after one request is completed.
> > > > > > 
> > > > > > Looks the flush request takes shorter time than normal write IO.
> > > > > > If flush request is added to front of dispatch queue, the next normal
> > > > > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > > > > queue.
> > > > > > trace_block_rq_issue() is called more than one time is because of
> > > > > > AHCI or the drive's implementation. It usually means that
> > > > > > host->hostt->queuecommand() fails for several times when queuing one
> > > > > > single request. For AHCI, I understand it shouldn't fail normally given
> > > > > > we guarantee that the actual queue depth is <= either LUN or host's
> > > > > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > > > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> > > > 
> > > > I put some debug prints and confirmed that the .queuecommand function is
> > > > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> > > > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> > > > notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> > > > upper layer only knows the queue depth, we are responsible for maintaining
> > > > exclusion.  This function checks whether a new command @qc can be issued." Then
> > > > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> > > > write command are waiting the completion each other.
> > > 
> > > OK, got it.
> > > 
> > > > 
> > > > > 
> > > > > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > > > > the AHCI code beside disk probe checks. So write & flush handling does not
> > > > > differ between SMR and regular disks. The same applies to the drive side. write
> > > > > and flush commands are the normal commands, no change at all. The only
> > > > > difference being the sequential write constraint which the drive honors by not
> > > > > executing the queued write command out of order. But there are no constraint for
> > > > > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > > > > 
> > > > > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > > > > 
> > > > > > 
> > > > > > Also can you provide queue flags via the following command?
> > > > > > 
> > > > > > 	cat /sys/kernel/debug/block/sdN/state
> > > > 
> > > > The state sysfs attribute was as follows:
> > > > 
> > > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> > > > 
> > > > It didn't change before and after the block/004 run.
> > > > 
> > > > 
> > > > > > 
> > > > > > I understand flush request should be slower than normal write, however
> > > > > > looks not true in this hardware.
> > > > > 
> > > > > Probably due to the fact that since the writes are sequential, there is a lot of
> > > > > drive internal optimization that can be done to minimize the cost of flush
> > > > > (internal data streaming from cache to media, media-cache use, etc) That is true
> > > > > for regular disks too. And all of this is highly dependent on the hardware
> > > > > implementation.
> > > > 
> > > > This discussion tempted me to take closer look in the traces. And I noticed that
> > > > number of flush commands issued is different with and without the patch.
> > > > 
> > > >                         | without patch | with patch
> > > > ------------------------+---------------+------------
> > > > block_getrq: rwbs=FWS   |      32640    |   32640
> > > > block_rq_issue: rwbs=FF |      32101    |    7593
> > > 
> > > Looks issued flush request is too many given the flush machinery
> > > should avoid to queue duplicated flush requests.
> > > 
> > > I will investigate the flush code a bit.
> > > 
> > > > 
> > > > Without the patch, flush command is issued between two write commands. With the
> > > > patch, some write commands are executed without flush between them.
> > > > 
> > > > I wonder how the requeue list position of flush command (head vs tail) changes
> > > > the number of flush commands to issue.
> > > > 
> > > > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> > > > doubles number of writes (256 * 64 = 16320). I will chase this further.
> > > 
> > > Indeed, not see such issue when I run the test on kvm ahci.
> > 
> > I found that the cause of the doubled flush commands is dm-linear device I set
> > up. I mapped two areas of single SMR drive to the dm-linear device, using dm
> > table with two lines. My understanding is that when a flush is requested to the
> > dm-linear device, the flush request is duplicated and forwarded to  destination
> > devices listed in dm table lines. In this case, a flush request is duplicated
> > for two lines in dm table, and both flush requests go to the single SMR drive.
> 
> Got it, maybe dm linear should avoid to send two flush requests to
> single drive.
> 
> Can you observe similar issue when running block/004 over the AHCI/SMR
> directly?
> 
> > This doubles the number of flush commands. I changed the dm table to have single
> > line to map single area, and observed the number of flush commands (block_getrq
> > with rwbs=FWS) got reduced to 16320, the same number as the write commands.
> > 
> > This means that the workload I observed the longer runtime of block/004 with
> > kernel 5.6-rc4 is really flush intensive. It has two flush commands per one
> > write command. It might be too sensitive to flush command behavior. I'm not so
> > sure if this is the workload worth performance care.
> 
> The story may be something like below:
> 
> 1) sometime, suppose there are two write request(w0, w1) in
> dispatch queue(hctx->dispatch), and one flush request is just done,
> so blk-mq starts to dispatch requests in hctx->dispatch via wq
> 
> 2) a new flush request F is scheduled via requeue wq, and the flush
> request is added to front of hctx->dispatch, which includes (F, w0, w1)
> 
> 3) meantime, given there are 64 fio jobs, one of the jobs just run queue 
> and queue one rq from scheduler to lld, and hctx->dispatch isn't observed
> because of timing, so sata switches to NCQ mode
> 
> 4) step 1 or 2 starts to run queue, and try to dispatch requests(F, w0, w1).
> .queue_rq() returns STS_RESOURCE for F because F is non-NCQ command,
> so hctx->dispatch becomes (w0, w1, F) now. And the queue will be re-run
> after one write request is completed, so one write IO latency is added
> to the flush request F.
> 
> 5) when flush request latency is increased, chance for flush request
> merge is increased, so much less flush requests are issued to sata
> in patched kernel, then issuing time for write requests is decreased.
> 
> If flush request is added to tail of hctx->dispatch in step 2), one
> extra IO latency won't be added to flush request, and flush request
> can be completed in less time, meantime chance of flush request merge
> is reduced much. When more flush requests are issued to sata, more time
> is introduced to issuing write IOs.
> 
> Anyway, this problem is NCQ specific. If it can be reproduced on
> AHCI/SMR directly, we may need to fix it.

In short, when adding flush rq to front of hctx->dispatch, it is easier
to introduce extra time to flush rq's latency compared with adding to tail
of dispatch queue, then chance of flush merge is increased, and less
flush requests will be issued to driver/controller, finally this kind of
flush workloads is improved.

So maybe we can add flush rq to tail of dispatch just in case of NCQ.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-10 14:37                                 ` Ming Lei
@ 2020-03-11  4:59                                   ` Shinichiro Kawasaki
  2020-03-11  7:54                                     ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Shinichiro Kawasaki @ 2020-03-11  4:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Mar 10, 2020 / 22:37, Ming Lei wrote:
> On Tue, Mar 10, 2020 at 09:37:24PM +0800, Ming Lei wrote:
> > On Tue, Mar 10, 2020 at 11:07:04AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 10, 2020 / 16:07, Ming Lei wrote:
> > > > On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> > > > > Ming, thank you for sharing the log files and analysis.
> > > > > 
> > > > > On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > > > > > On 2020/03/10 1:14, Ming Lei wrote:
> > > > > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > > > > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > > > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > > > >>>>>>> Hi Shinichiro,
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > > > > > >>>>>>>>>>>> commit makes the test case longer.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > >>>>>>>>>>>> expected?
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> The following patch might address this issue:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Please test and provide us the result.
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> thanks,
> > > > > > >>>>>>>>>>> Ming
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Hi Ming,
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > >>>>>>>>>> Still it takes around 300 seconds.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Hello Shinichiro,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > > >>>>>>>>> enough.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> There are two related effect in that commit for your issue:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > > >>>>>>>>> request
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > > >>>>>>>>> returns STS_RESOURCE
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > > >>>>>>>>> SAS.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> So please test the following two patches, and see which one makes a
> > > > > > >>>>>>>>> difference for you.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> 1) patch 1
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > > > > > >>>>>>>>> --- a/block/blk-mq-sched.c
> > > > > > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > > > > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > > > > > >>>>>>>>>  
> > > > > > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > > >>>>>>>>>  		goto run;
> > > > > > >>>>>>>>>  	}
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Ming, thank you for the trial patches.
> > > > > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> 2) patch 2
> > > > > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > > > > > >>>>>>>>> --- a/block/blk-mq.c
> > > > > > >>>>>>>>> +++ b/block/blk-mq.c
> > > > > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > > > > > >>>>>>>>>  
> > > > > > >>>>>>>>>  		spin_lock(&hctx->lock);
> > > > > > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > > > > > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > > > > > >>>>>>>>>  
> > > > > > >>>>>>>>>  		/*
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> This patch 2 didn't reduce the runtime.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Wish this report helps.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Your feedback does help, then please test the following patch:
> > > > > > >>>>>>
> > > > > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > > > >>>>>> it reduces the runtime as short as rc3. Good.
> > > > > > >>>>>
> > > > > > >>>>> Hi Shinichiro,
> > > > > > >>>>>
> > > > > > >>>>> Thanks for your test!
> > > > > > >>>>>
> > > > > > >>>>> Then I think the following change should make the difference actually,
> > > > > > >>>>> you may double check that and confirm if it is that.
> > > > > > >>>>>
> > > > > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > > > > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > > > > > >>>>>>  	flush_rq->end_io = flush_end_io;
> > > > > > >>>>>>  
> > > > > > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > > > > > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > > > > > >>>>
> > > > > > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > > > > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > > > > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > > > > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > > > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > > > > >>>>> correct.
> > > > > > >>>>>
> > > > > > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > > > > > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > > > > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > > > > > >>>>> submitted to LLD.
> > > > > > >>>>>
> > > > > > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > > > > > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > > > > > >>>>> the following trace via the following script with/without the single line
> > > > > > >>>>> patch?
> > > > > > >>>>
> > > > > > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > > > > > >>>> with and without the one liner patch above. The gzip archived trace files have
> > > > > > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > > > > >>>> want the trace files shared.
> > > > > > >>>
> > > > > > >>> I didn't thought the trace can be so big given the ios should be just
> > > > > > >>> 256 * 64(1564).
> > > > > > >>>
> > > > > > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > > > > > >>> whatever. Then just provides us the link.
> > > > > > >>>
> > > > > > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > > > > > >>> it in my RH people web link.
> > > > > > >>
> > > > > > >> I have sent another e-mail only to you attaching the log files gziped.
> > > > > > >> Your confirmation will be appreciated.
> > > > > > > 
> > > > > > > Yeah, I got the log, and it has been put in the following link:
> > > > > > > 
> > > > > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > > > > > 
> > > > > > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > > > > > fsync() in each job.
> > > > > > > 
> > > > > > > 1) v5.6-rc kernel
> > > > > > > - write rq average latency: 0.091s 
> > > > > > > - flush rq average latency: 0.018s
> > > > > > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > > > > > - average issue times of flush request: 1.052
> > > > > > > 
> > > > > > > 2) v5.6-rc patched kernel
> > > > > > > - write rq average latency: 0.031
> > > > > > > - flush rq average latency: 0.035
> > > > > > > - average issue times of write request: 1.466
> > > > > > > - average issue times of flush request: 1.610
> > > > > > > 
> > > > > > > 
> > > > > > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > > > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > > > > > queue can only move on after one request is completed.
> > > > > > > 
> > > > > > > Looks the flush request takes shorter time than normal write IO.
> > > > > > > If flush request is added to front of dispatch queue, the next normal
> > > > > > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > > > > > queue.
> > > > > > > trace_block_rq_issue() is called more than one time is because of
> > > > > > > AHCI or the drive's implementation. It usually means that
> > > > > > > host->hostt->queuecommand() fails for several times when queuing one
> > > > > > > single request. For AHCI, I understand it shouldn't fail normally given
> > > > > > > we guarantee that the actual queue depth is <= either LUN or host's
> > > > > > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > > > > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> > > > > 
> > > > > I put some debug prints and confirmed that the .queuecommand function is
> > > > > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> > > > > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> > > > > notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> > > > > upper layer only knows the queue depth, we are responsible for maintaining
> > > > > exclusion.  This function checks whether a new command @qc can be issued." Then
> > > > > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> > > > > write command are waiting the completion each other.
> > > > 
> > > > OK, got it.
> > > > 
> > > > > 
> > > > > > 
> > > > > > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > > > > > the AHCI code beside disk probe checks. So write & flush handling does not
> > > > > > differ between SMR and regular disks. The same applies to the drive side. write
> > > > > > and flush commands are the normal commands, no change at all. The only
> > > > > > difference being the sequential write constraint which the drive honors by not
> > > > > > executing the queued write command out of order. But there are no constraint for
> > > > > > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > > > > > 
> > > > > > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > > > > > 
> > > > > > > 
> > > > > > > Also can you provide queue flags via the following command?
> > > > > > > 
> > > > > > > 	cat /sys/kernel/debug/block/sdN/state
> > > > > 
> > > > > The state sysfs attribute was as follows:
> > > > > 
> > > > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> > > > > 
> > > > > It didn't change before and after the block/004 run.
> > > > > 
> > > > > 
> > > > > > > 
> > > > > > > I understand flush request should be slower than normal write, however
> > > > > > > looks not true in this hardware.
> > > > > > 
> > > > > > Probably due to the fact that since the writes are sequential, there is a lot of
> > > > > > drive internal optimization that can be done to minimize the cost of flush
> > > > > > (internal data streaming from cache to media, media-cache use, etc) That is true
> > > > > > for regular disks too. And all of this is highly dependent on the hardware
> > > > > > implementation.
> > > > > 
> > > > > This discussion tempted me to take closer look in the traces. And I noticed that
> > > > > number of flush commands issued is different with and without the patch.
> > > > > 
> > > > >                         | without patch | with patch
> > > > > ------------------------+---------------+------------
> > > > > block_getrq: rwbs=FWS   |      32640    |   32640
> > > > > block_rq_issue: rwbs=FF |      32101    |    7593
> > > > 
> > > > Looks issued flush request is too many given the flush machinery
> > > > should avoid to queue duplicated flush requests.
> > > > 
> > > > I will investigate the flush code a bit.
> > > > 
> > > > > 
> > > > > Without the patch, flush command is issued between two write commands. With the
> > > > > patch, some write commands are executed without flush between them.
> > > > > 
> > > > > I wonder how the requeue list position of flush command (head vs tail) changes
> > > > > the number of flush commands to issue.
> > > > > 
> > > > > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> > > > > doubles number of writes (256 * 64 = 16320). I will chase this further.
> > > > 
> > > > Indeed, not see such issue when I run the test on kvm ahci.
> > > 
> > > I found that the cause of the doubled flush commands is dm-linear device I set
> > > up. I mapped two areas of single SMR drive to the dm-linear device, using dm
> > > table with two lines. My understanding is that when a flush is requested to the
> > > dm-linear device, the flush request is duplicated and forwarded to  destination
> > > devices listed in dm table lines. In this case, a flush request is duplicated
> > > for two lines in dm table, and both flush requests go to the single SMR drive.
> > 
> > Got it, maybe dm linear should avoid to send two flush requests to
> > single drive.
> > 
> > Can you observe similar issue when running block/004 over the AHCI/SMR
> > directly?

Yes. The runtime delta is smaller and I had not noticed that. Without the patch,
block/004 run on AHCI/SMR drive takes from 115+ seconds. With the patch,
it takes around 100 seconds.

I also took the bpf traces and counted flush commands issued. Numbsers look
similar as those with dm-linear.

                        | without patch | with patch
------------------------+---------------+------------
block_getrq: rwbs=FWS   |      16320    |   16320
block_rq_issue: rwbs=FF |      17051    |    5866

> > 
> > > This doubles the number of flush commands. I changed the dm table to have single
> > > line to map single area, and observed the number of flush commands (block_getrq
> > > with rwbs=FWS) got reduced to 16320, the same number as the write commands.
> > > 
> > > This means that the workload I observed the longer runtime of block/004 with
> > > kernel 5.6-rc4 is really flush intensive. It has two flush commands per one
> > > write command. It might be too sensitive to flush command behavior. I'm not so
> > > sure if this is the workload worth performance care.
> > 
> > The story may be something like below:
> > 
> > 1) sometime, suppose there are two write request(w0, w1) in
> > dispatch queue(hctx->dispatch), and one flush request is just done,
> > so blk-mq starts to dispatch requests in hctx->dispatch via wq
> > 
> > 2) a new flush request F is scheduled via requeue wq, and the flush
> > request is added to front of hctx->dispatch, which includes (F, w0, w1)
> > 
> > 3) meantime, given there are 64 fio jobs, one of the jobs just run queue 
> > and queue one rq from scheduler to lld, and hctx->dispatch isn't observed
> > because of timing, so sata switches to NCQ mode
> > 
> > 4) step 1 or 2 starts to run queue, and try to dispatch requests(F, w0, w1).
> > .queue_rq() returns STS_RESOURCE for F because F is non-NCQ command,
> > so hctx->dispatch becomes (w0, w1, F) now. And the queue will be re-run
> > after one write request is completed, so one write IO latency is added
> > to the flush request F.
> > 
> > 5) when flush request latency is increased, chance for flush request
> > merge is increased, so much less flush requests are issued to sata
> > in patched kernel, then issuing time for write requests is decreased.
> > 
> > If flush request is added to tail of hctx->dispatch in step 2), one
> > extra IO latency won't be added to flush request, and flush request
> > can be completed in less time, meantime chance of flush request merge
> > is reduced much. When more flush requests are issued to sata, more time
> > is introduced to issuing write IOs.
> > 
> > Anyway, this problem is NCQ specific. If it can be reproduced on
> > AHCI/SMR directly, we may need to fix it.
> 
> In short, when adding flush rq to front of hctx->dispatch, it is easier
> to introduce extra time to flush rq's latency compared with adding to tail
> of dispatch queue, then chance of flush merge is increased, and less
> flush requests will be issued to driver/controller, finally this kind of
> flush workloads is improved.

Ming, thank you very much for the description.

> 
> So maybe we can add flush rq to tail of dispatch just in case of NCQ.

Reading the line above, I'm not yet sure how the fix will be. Do you mean
to check if sata driver is in NCQ mode, within blk_kick_flush()?

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: commit 01e99aeca397 causes longer runtime of block/004
  2020-03-11  4:59                                   ` Shinichiro Kawasaki
@ 2020-03-11  7:54                                     ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2020-03-11  7:54 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Damien Le Moal, linux-block, Jens Axboe

On Wed, Mar 11, 2020 at 04:59:23AM +0000, Shinichiro Kawasaki wrote:
> On Mar 10, 2020 / 22:37, Ming Lei wrote:
> > On Tue, Mar 10, 2020 at 09:37:24PM +0800, Ming Lei wrote:
> > > On Tue, Mar 10, 2020 at 11:07:04AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 10, 2020 / 16:07, Ming Lei wrote:
> > > > > On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote:
> > > > > > Ming, thank you for sharing the log files and analysis.
> > > > > > 
> > > > > > On Mar 10, 2020 / 03:07, Damien Le Moal wrote:
> > > > > > > On 2020/03/10 1:14, Ming Lei wrote:
> > > > > > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote:
> > > > > > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote:
> > > > > > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > > > > > >>>>>>> Hi Shinichiro,
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to understand why the
> > > > > > > >>>>>>>>>>>> commit makes the test case longer.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > >>>>>>>>>>>> expected?
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> The following patch might address this issue:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@google.com/#t
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Please test and provide us the result.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> thanks,
> > > > > > > >>>>>>>>>>> Ming
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Hi Ming,
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > > >>>>>>>>>> Still it takes around 300 seconds.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Hello Shinichiro,
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > > > >>>>>>>>> enough.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> There are two related effect in that commit for your issue:
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > > > >>>>>>>>> request
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > > > >>>>>>>>> returns STS_RESOURCE
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > > > >>>>>>>>> SAS.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> So please test the following two patches, and see which one makes a
> > > > > > > >>>>>>>>> difference for you.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> 1) patch 1
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > > >>>>>>>>> index 856356b1619e..86137c75283c 100644
> > > > > > > >>>>>>>>> --- a/block/blk-mq-sched.c
> > > > > > > >>>>>>>>> +++ b/block/blk-mq-sched.c
> > > > > > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > > > > >>>>>>>>>  	WARN_ON(e && (rq->tag != -1));
> > > > > > > >>>>>>>>>  
> > > > > > > >>>>>>>>>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > > > >>>>>>>>> -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > > > >>>>>>>>> +		blk_mq_request_bypass_insert(rq, true, false);
> > > > > > > >>>>>>>>>  		goto run;
> > > > > > > >>>>>>>>>  	}
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Ming, thank you for the trial patches.
> > > > > > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> 2) patch 2
> > > > > > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644
> > > > > > > >>>>>>>>> --- a/block/blk-mq.c
> > > > > > > >>>>>>>>> +++ b/block/blk-mq.c
> > > > > > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > > > > >>>>>>>>>  			q->mq_ops->commit_rqs(hctx);
> > > > > > > >>>>>>>>>  
> > > > > > > >>>>>>>>>  		spin_lock(&hctx->lock);
> > > > > > > >>>>>>>>> -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > > > >>>>>>>>> +		list_splice_init(list, &hctx->dispatch);
> > > > > > > >>>>>>>>>  		spin_unlock(&hctx->lock);
> > > > > > > >>>>>>>>>  
> > > > > > > >>>>>>>>>  		/*
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> This patch 2 didn't reduce the runtime.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Wish this report helps.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Your feedback does help, then please test the following patch:
> > > > > > > >>>>>>
> > > > > > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > > > > > > >>>>>> it reduces the runtime as short as rc3. Good.
> > > > > > > >>>>>
> > > > > > > >>>>> Hi Shinichiro,
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks for your test!
> > > > > > > >>>>>
> > > > > > > >>>>> Then I think the following change should make the difference actually,
> > > > > > > >>>>> you may double check that and confirm if it is that.
> > > > > > > >>>>>
> > > > > > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> > > > > > > >>>>>>  	flush_rq->rq_disk = first_rq->rq_disk;
> > > > > > > >>>>>>  	flush_rq->end_io = flush_end_io;
> > > > > > > >>>>>>  
> > > > > > > >>>>>> -	blk_flush_queue_rq(flush_rq, false);
> > > > > > > >>>>>> +	blk_flush_queue_rq(flush_rq, true);
> > > > > > > >>>>
> > > > > > > >>>> Yes, with this the one line change above only, the runtime was reduced.
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> However, the flush request is added to tail of dispatch queue[1] for long time.
> > > > > > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> > > > > > > >>>>> and its predecessor(all mq scheduler start patches) changed to add flush request
> > > > > > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> > > > > > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> > > > > > > >>>>> correct.
> > > > > > > >>>>>
> > > > > > > >>>>> I guess once flush rq is added to tail of dispatch queue in block/004,
> > > > > > > >>>>> in which lots of FS request may stay in hctx->dispatch because of low
> > > > > > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to be
> > > > > > > >>>>> submitted to LLD.
> > > > > > > >>>>>
> > > > > > > >>>>> I'd suggest to root cause/understand the issue given it isn't obvious
> > > > > > > >>>>> correct to queue flush rq at front of dispatch queue, so could you collect
> > > > > > > >>>>> the following trace via the following script with/without the single line
> > > > > > > >>>>> patch?
> > > > > > > >>>>
> > > > > > > >>>> Thank you for the thoughts for the correct design. I have taken the two traces,
> > > > > > > >>>> with and without the one liner patch above. The gzip archived trace files have
> > > > > > > >>>> 1.6MB size. It looks too large to post to the list. Please let me know how you
> > > > > > > >>>> want the trace files shared.
> > > > > > > >>>
> > > > > > > >>> I didn't thought the trace can be so big given the ios should be just
> > > > > > > >>> 256 * 64(1564).
> > > > > > > >>>
> > > > > > > >>> You may put the log somewhere in Internet, cloud storage, web, or
> > > > > > > >>> whatever. Then just provides us the link.
> > > > > > > >>>
> > > > > > > >>> Or if you can't find a place to hold it, just send to me, and I will put
> > > > > > > >>> it in my RH people web link.
> > > > > > > >>
> > > > > > > >> I have sent another e-mail only to you attaching the log files gziped.
> > > > > > > >> Your confirmation will be appreciated.
> > > > > > > > 
> > > > > > > > Yeah, I got the log, and it has been put in the following link:
> > > > > > > > 
> > > > > > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
> > > > > > > > 
> > > > > > > > Also I have run some analysis, and block/004 basically call pwrite() &
> > > > > > > > fsync() in each job.
> > > > > > > > 
> > > > > > > > 1) v5.6-rc kernel
> > > > > > > > - write rq average latency: 0.091s 
> > > > > > > > - flush rq average latency: 0.018s
> > > > > > > > - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
> > > > > > > > - average issue times of flush request: 1.052
> > > > > > > > 
> > > > > > > > 2) v5.6-rc patched kernel
> > > > > > > > - write rq average latency: 0.031
> > > > > > > > - flush rq average latency: 0.035
> > > > > > > > - average issue times of write request: 1.466
> > > > > > > > - average issue times of flush request: 1.610
> > > > > > > > 
> > > > > > > > 
> > > > > > > > block/004 starts 64 jobs and AHCI's queue can become saturated easily,
> > > > > > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
> > > > > > > > queue can only move on after one request is completed.
> > > > > > > > 
> > > > > > > > Looks the flush request takes shorter time than normal write IO.
> > > > > > > > If flush request is added to front of dispatch queue, the next normal
> > > > > > > > write IO could be queued to lld quicker than adding to tail of dispatch
> > > > > > > > queue.
> > > > > > > > trace_block_rq_issue() is called more than one time is because of
> > > > > > > > AHCI or the drive's implementation. It usually means that
> > > > > > > > host->hostt->queuecommand() fails for several times when queuing one
> > > > > > > > single request. For AHCI, I understand it shouldn't fail normally given
> > > > > > > > we guarantee that the actual queue depth is <= either LUN or host's
> > > > > > > > queue depth. Maybe it depends on your SMR's implementation about handling
> > > > > > > > flush/write IO. Could you check why .queuecommand() fails in block/004?
> > > > > > 
> > > > > > I put some debug prints and confirmed that the .queuecommand function is
> > > > > > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> > > > > > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> > > > > > notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> > > > > > upper layer only knows the queue depth, we are responsible for maintaining
> > > > > > exclusion.  This function checks whether a new command @qc can be issued." Then
> > > > > > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> > > > > > write command are waiting the completion each other.
> > > > > 
> > > > > OK, got it.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
> > > > > > > the AHCI code beside disk probe checks. So write & flush handling does not
> > > > > > > differ between SMR and regular disks. The same applies to the drive side. write
> > > > > > > and flush commands are the normal commands, no change at all. The only
> > > > > > > difference being the sequential write constraint which the drive honors by not
> > > > > > > executing the queued write command out of order. But there are no constraint for
> > > > > > > flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
> > > > > > > 
> > > > > > > I wonder if the issue may be with the particular AHCI chipset used for this test.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Also can you provide queue flags via the following command?
> > > > > > > > 
> > > > > > > > 	cat /sys/kernel/debug/block/sdN/state
> > > > > > 
> > > > > > The state sysfs attribute was as follows:
> > > > > > 
> > > > > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> > > > > > 
> > > > > > It didn't change before and after the block/004 run.
> > > > > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > I understand flush request should be slower than normal write, however
> > > > > > > > looks not true in this hardware.
> > > > > > > 
> > > > > > > Probably due to the fact that since the writes are sequential, there is a lot of
> > > > > > > drive internal optimization that can be done to minimize the cost of flush
> > > > > > > (internal data streaming from cache to media, media-cache use, etc) That is true
> > > > > > > for regular disks too. And all of this is highly dependent on the hardware
> > > > > > > implementation.
> > > > > > 
> > > > > > This discussion tempted me to take closer look in the traces. And I noticed that
> > > > > > number of flush commands issued is different with and without the patch.
> > > > > > 
> > > > > >                         | without patch | with patch
> > > > > > ------------------------+---------------+------------
> > > > > > block_getrq: rwbs=FWS   |      32640    |   32640
> > > > > > block_rq_issue: rwbs=FF |      32101    |    7593
> > > > > 
> > > > > Looks issued flush request is too many given the flush machinery
> > > > > should avoid to queue duplicated flush requests.
> > > > > 
> > > > > I will investigate the flush code a bit.
> > > > > 
> > > > > > 
> > > > > > Without the patch, flush command is issued between two write commands. With the
> > > > > > patch, some write commands are executed without flush between them.
> > > > > > 
> > > > > > I wonder how the requeue list position of flush command (head vs tail) changes
> > > > > > the number of flush commands to issue.
> > > > > > 
> > > > > > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> > > > > > doubles number of writes (256 * 64 = 16320). I will chase this further.
> > > > > 
> > > > > Indeed, not see such issue when I run the test on kvm ahci.
> > > > 
> > > > I found that the cause of the doubled flush commands is dm-linear device I set
> > > > up. I mapped two areas of single SMR drive to the dm-linear device, using dm
> > > > table with two lines. My understanding is that when a flush is requested to the
> > > > dm-linear device, the flush request is duplicated and forwarded to  destination
> > > > devices listed in dm table lines. In this case, a flush request is duplicated
> > > > for two lines in dm table, and both flush requests go to the single SMR drive.
> > > 
> > > Got it, maybe dm linear should avoid to send two flush requests to
> > > single drive.
> > > 
> > > Can you observe similar issue when running block/004 over the AHCI/SMR
> > > directly?
> 
> Yes. The runtime delta is smaller and I had not noticed that. Without the patch,
> block/004 run on AHCI/SMR drive takes from 115+ seconds. With the patch,
> it takes around 100 seconds.
> 
> I also took the bpf traces and counted flush commands issued. Numbsers look
> similar as those with dm-linear.
> 
>                         | without patch | with patch
> ------------------------+---------------+------------
> block_getrq: rwbs=FWS   |      16320    |   16320
> block_rq_issue: rwbs=FF |      17051    |    5866

BTW, you should count 'block_rq_complete: rwbs=FF' instead of
block_rq_issue() which may be called several times for each flush
request.

Anyway the effect is visible.

> 
> > > 
> > > > This doubles the number of flush commands. I changed the dm table to have single
> > > > line to map single area, and observed the number of flush commands (block_getrq
> > > > with rwbs=FWS) got reduced to 16320, the same number as the write commands.
> > > > 
> > > > This means that the workload I observed the longer runtime of block/004 with
> > > > kernel 5.6-rc4 is really flush intensive. It has two flush commands per one
> > > > write command. It might be too sensitive to flush command behavior. I'm not so
> > > > sure if this is the workload worth performance care.
> > > 
> > > The story may be something like below:
> > > 
> > > 1) sometime, suppose there are two write request(w0, w1) in
> > > dispatch queue(hctx->dispatch), and one flush request is just done,
> > > so blk-mq starts to dispatch requests in hctx->dispatch via wq
> > > 
> > > 2) a new flush request F is scheduled via requeue wq, and the flush
> > > request is added to front of hctx->dispatch, which includes (F, w0, w1)
> > > 
> > > 3) meantime, given there are 64 fio jobs, one of the jobs just run queue 
> > > and queue one rq from scheduler to lld, and hctx->dispatch isn't observed
> > > because of timing, so sata switches to NCQ mode
> > > 
> > > 4) step 1 or 2 starts to run queue, and try to dispatch requests(F, w0, w1).
> > > .queue_rq() returns STS_RESOURCE for F because F is non-NCQ command,
> > > so hctx->dispatch becomes (w0, w1, F) now. And the queue will be re-run
> > > after one write request is completed, so one write IO latency is added
> > > to the flush request F.
> > > 
> > > 5) when flush request latency is increased, chance for flush request
> > > merge is increased, so much less flush requests are issued to sata
> > > in patched kernel, then issuing time for write requests is decreased.
> > > 
> > > If flush request is added to tail of hctx->dispatch in step 2), one
> > > extra IO latency won't be added to flush request, and flush request
> > > can be completed in less time, meantime chance of flush request merge
> > > is reduced much. When more flush requests are issued to sata, more time
> > > is introduced to issuing write IOs.
> > > 
> > > Anyway, this problem is NCQ specific. If it can be reproduced on
> > > AHCI/SMR directly, we may need to fix it.
> > 
> > In short, when adding flush rq to front of hctx->dispatch, it is easier
> > to introduce extra time to flush rq's latency compared with adding to tail
> > of dispatch queue, then chance of flush merge is increased, and less
> > flush requests will be issued to driver/controller, finally this kind of
> > flush workloads is improved.
> 
> Ming, thank you very much for the description.
> 
> > 
> > So maybe we can add flush rq to tail of dispatch just in case of NCQ.
> 
> Reading the line above, I'm not yet sure how the fix will be. Do you mean
> to check if sata driver is in NCQ mode, within blk_kick_flush()?

We may switch to blk_kick_flush(true) directly, and I will figure out one
patch.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-03-11  7:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04  2:38 commit 01e99aeca397 causes longer runtime of block/004 Shinichiro Kawasaki
2020-03-04  3:46 ` Ming Lei
2020-03-04  6:11   ` Shinichiro Kawasaki
2020-03-04  9:53     ` Ming Lei
2020-03-05  1:19       ` Shinichiro Kawasaki
2020-03-05  2:48         ` Ming Lei
2020-03-06  6:06           ` Shinichiro Kawasaki
2020-03-06  8:13             ` Ming Lei
2020-03-07  1:02               ` Shinichiro Kawasaki
2020-03-07  4:13                 ` Ming Lei
2020-03-09  0:07                   ` Shinichiro Kawasaki
2020-03-09 16:14                     ` Ming Lei
2020-03-10  3:07                       ` Damien Le Moal
2020-03-10  5:54                         ` Shinichiro Kawasaki
2020-03-10  6:00                           ` Damien Le Moal
2020-03-10  8:07                           ` Ming Lei
2020-03-10 11:07                             ` Shinichiro Kawasaki
2020-03-10 13:37                               ` Ming Lei
2020-03-10 14:37                                 ` Ming Lei
2020-03-11  4:59                                   ` Shinichiro Kawasaki
2020-03-11  7:54                                     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.