All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: commit 01e99aeca397 causes longer runtime of block/004
Date: Tue, 10 Mar 2020 06:00:06 +0000	[thread overview]
Message-ID: <BYAPR04MB5816B51DA078EBCECC08D019E7FF0@BYAPR04MB5816.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20200310055417.o3jghx4sl5xtztci@shindev.dhcp.fujisawa.hgst.com

On 2020/03/10 14:54, Shinichiro Kawasaki wrote:
[...]
>>> Yeah, I got the log, and it has been put in the following link:
>>>
>>> http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
>>>
>>> Also I have run some analysis, and block/004 basically call pwrite() &
>>> fsync() in each job.
>>>
>>> 1) v5.6-rc kernel
>>> - write rq average latency: 0.091s 
>>> - flush rq average latency: 0.018s
>>> - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
>>> - average issue times of flush request: 1.052
>>>
>>> 2) v5.6-rc patched kernel
>>> - write rq average latency: 0.031
>>> - flush rq average latency: 0.035
>>> - average issue times of write request: 1.466
>>> - average issue times of flush request: 1.610
>>>
>>>
>>> block/004 starts 64 jobs and AHCI's queue can become saturated easily,
>>> then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
>>> queue can only move on after one request is completed.
>>>
>>> Looks the flush request takes shorter time than normal write IO.
>>> If flush request is added to front of dispatch queue, the next normal
>>> write IO could be queued to lld quicker than adding to tail of dispatch
>>> queue.
>>> trace_block_rq_issue() is called more than one time is because of
>>> AHCI or the drive's implementation. It usually means that
>>> host->hostt->queuecommand() fails for several times when queuing one
>>> single request. For AHCI, I understand it shouldn't fail normally given
>>> we guarantee that the actual queue depth is <= either LUN or host's
>>> queue depth. Maybe it depends on your SMR's implementation about handling
>>> flush/write IO. Could you check why .queuecommand() fails in block/004?
> 
> I put some debug prints and confirmed that the .queuecommand function is
> ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> upper layer only knows the queue depth, we are responsible for maintaining
> exclusion.  This function checks whether a new command @qc can be issued." Then
> I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> write command are waiting the completion each other.
> 
>>
>> Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
>> the AHCI code beside disk probe checks. So write & flush handling does not
>> differ between SMR and regular disks. The same applies to the drive side. write
>> and flush commands are the normal commands, no change at all. The only
>> difference being the sequential write constraint which the drive honors by not
>> executing the queued write command out of order. But there are no constraint for
>> flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
>>
>> I wonder if the issue may be with the particular AHCI chipset used for this test.
>>
>>>
>>> Also can you provide queue flags via the following command?
>>>
>>> 	cat /sys/kernel/debug/block/sdN/state
> 
> The state sysfs attribute was as follows:
> 
> SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> 
> It didn't change before and after the block/004 run.
> 
> 
>>>
>>> I understand flush request should be slower than normal write, however
>>> looks not true in this hardware.
>>
>> Probably due to the fact that since the writes are sequential, there is a lot of
>> drive internal optimization that can be done to minimize the cost of flush
>> (internal data streaming from cache to media, media-cache use, etc) That is true
>> for regular disks too. And all of this is highly dependent on the hardware
>> implementation.
> 
> This discussion tempted me to take closer look in the traces. And I noticed that
> number of flush commands issued is different with and without the patch.
> 
>                         | without patch | with patch
> ------------------------+---------------+------------
> block_getrq: rwbs=FWS   |      32640    |   32640
> block_rq_issue: rwbs=FF |      32101    |    7593
> 
> Without the patch, flush command is issued between two write commands. With the
> patch, some write commands are executed without flush between them.
> 
> I wonder how the requeue list position of flush command (head vs tail) changes
> the number of flush commands to issue.
> 
> Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> doubles number of writes (256 * 64 = 16320). I will chase this further.
> 
> 
>> Shinichiro,
>>
>> It may be worth trying the same run with & without Ming's patch on a machine
>> with a different chipset...
> 
> Thanks for the advice. So far, I observer the long block/004 runtime on two
> systems. One with Intel C220 Series SATA controller, and the other with Intel
> 200 Series PCH SATA controller. I will try to find other SATA controller.

I do not think it is necessary. I forgot that flush cache is a non-ncq command.
That is why queuecommand fails as you found out: it is waiting for all NCQ
commands to complete first. Nothing to do with the chipset type. This is per ATA
specs.


-- 
Damien Le Moal
Western Digital Research

  reply	other threads:[~2020-03-10  6:00 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-04  2:38 commit 01e99aeca397 causes longer runtime of block/004 Shinichiro Kawasaki
2020-03-04  3:46 ` Ming Lei
2020-03-04  6:11   ` Shinichiro Kawasaki
2020-03-04  9:53     ` Ming Lei
2020-03-05  1:19       ` Shinichiro Kawasaki
2020-03-05  2:48         ` Ming Lei
2020-03-06  6:06           ` Shinichiro Kawasaki
2020-03-06  8:13             ` Ming Lei
2020-03-07  1:02               ` Shinichiro Kawasaki
2020-03-07  4:13                 ` Ming Lei
2020-03-09  0:07                   ` Shinichiro Kawasaki
2020-03-09 16:14                     ` Ming Lei
2020-03-10  3:07                       ` Damien Le Moal
2020-03-10  5:54                         ` Shinichiro Kawasaki
2020-03-10  6:00                           ` Damien Le Moal [this message]
2020-03-10  8:07                           ` Ming Lei
2020-03-10 11:07                             ` Shinichiro Kawasaki
2020-03-10 13:37                               ` Ming Lei
2020-03-10 14:37                                 ` Ming Lei
2020-03-11  4:59                                   ` Shinichiro Kawasaki
2020-03-11  7:54                                     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR04MB5816B51DA078EBCECC08D019E7FF0@BYAPR04MB5816.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.