Re: commit 01e99aeca397 causes longer runtime of block/004

From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: commit 01e99aeca397 causes longer runtime of block/004
Date: Tue, 10 Mar 2020 06:00:06 +0000	[thread overview]
Message-ID: <BYAPR04MB5816B51DA078EBCECC08D019E7FF0@BYAPR04MB5816.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20200310055417.o3jghx4sl5xtztci@shindev.dhcp.fujisawa.hgst.com

On 2020/03/10 14:54, Shinichiro Kawasaki wrote:
[...]
>>> Yeah, I got the log, and it has been put in the following link:
>>>
>>> http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz
>>>
>>> Also I have run some analysis, and block/004 basically call pwrite() &
>>> fsync() in each job.
>>>
>>> 1) v5.6-rc kernel
>>> - write rq average latency: 0.091s 
>>> - flush rq average latency: 0.018s
>>> - average issue times of write request: 1.978  //how many trace_block_rq_issue() is called for one rq
>>> - average issue times of flush request: 1.052
>>>
>>> 2) v5.6-rc patched kernel
>>> - write rq average latency: 0.031
>>> - flush rq average latency: 0.035
>>> - average issue times of write request: 1.466
>>> - average issue times of flush request: 1.610
>>>
>>>
>>> block/004 starts 64 jobs and AHCI's queue can become saturated easily,
>>> then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch
>>> queue can only move on after one request is completed.
>>>
>>> Looks the flush request takes shorter time than normal write IO.
>>> If flush request is added to front of dispatch queue, the next normal
>>> write IO could be queued to lld quicker than adding to tail of dispatch
>>> queue.
>>> trace_block_rq_issue() is called more than one time is because of
>>> AHCI or the drive's implementation. It usually means that
>>> host->hostt->queuecommand() fails for several times when queuing one
>>> single request. For AHCI, I understand it shouldn't fail normally given
>>> we guarantee that the actual queue depth is <= either LUN or host's
>>> queue depth. Maybe it depends on your SMR's implementation about handling
>>> flush/write IO. Could you check why .queuecommand() fails in block/004?
> 
> I put some debug prints and confirmed that the .queuecommand function is
> ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because
> ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer()
> notes that "Non-NCQ commands cannot run with any other command, NCQ or not.  As
> upper layer only knows the queue depth, we are responsible for maintaining
> exclusion.  This function checks whether a new command @qc can be issued." Then
> I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ
> write command are waiting the completion each other.
> 
>>
>> Indeed, that is weird that queuecommand fails. There is nothing SMR specific in
>> the AHCI code beside disk probe checks. So write & flush handling does not
>> differ between SMR and regular disks. The same applies to the drive side. write
>> and flush commands are the normal commands, no change at all. The only
>> difference being the sequential write constraint which the drive honors by not
>> executing the queued write command out of order. But there are no constraint for
>> flush on SMR, so we get whatever the drive does, that is, totally drive dependent.
>>
>> I wonder if the issue may be with the particular AHCI chipset used for this test.
>>
>>>
>>> Also can you provide queue flags via the following command?
>>>
>>> 	cat /sys/kernel/debug/block/sdN/state
> 
> The state sysfs attribute was as follows:
> 
> SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26
> 
> It didn't change before and after the block/004 run.
> 
> 
>>>
>>> I understand flush request should be slower than normal write, however
>>> looks not true in this hardware.
>>
>> Probably due to the fact that since the writes are sequential, there is a lot of
>> drive internal optimization that can be done to minimize the cost of flush
>> (internal data streaming from cache to media, media-cache use, etc) That is true
>> for regular disks too. And all of this is highly dependent on the hardware
>> implementation.
> 
> This discussion tempted me to take closer look in the traces. And I noticed that
> number of flush commands issued is different with and without the patch.
> 
>                         | without patch | with patch
> ------------------------+---------------+------------
> block_getrq: rwbs=FWS   |      32640    |   32640
> block_rq_issue: rwbs=FF |      32101    |    7593
> 
> Without the patch, flush command is issued between two write commands. With the
> patch, some write commands are executed without flush between them.
> 
> I wonder how the requeue list position of flush command (head vs tail) changes
> the number of flush commands to issue.
> 
> Another weird thing is number of block_getrq traces of flush (rwds=FWS). It
> doubles number of writes (256 * 64 = 16320). I will chase this further.
> 
> 
>> Shinichiro,
>>
>> It may be worth trying the same run with & without Ming's patch on a machine
>> with a different chipset...
> 
> Thanks for the advice. So far, I observer the long block/004 runtime on two
> systems. One with Intel C220 Series SATA controller, and the other with Intel
> 200 Series PCH SATA controller. I will try to find other SATA controller.

I do not think it is necessary. I forgot that flush cache is a non-ncq command.
That is why queuecommand fails as you found out: it is waiting for all NCQ
commands to complete first. Nothing to do with the chipset type. This is per ATA
specs.

-- 
Damien Le Moal
Western Digital Research