Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

From: John Garry <john.garry@huawei.com>
To: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block <linux-block@vger.kernel.org>,
	Bart Van Assche <bvanassche@acm.org>,
	"Hannes Reinecke" <hare@suse.com>, Christoph Hellwig <hch@lst.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Keith Busch <keith.busch@intel.com>
Subject: Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
Date: Fri, 11 Oct 2019 15:10:03 +0100	[thread overview]
Message-ID: <b1a561c1-9594-cc25-dcab-bad5c342264f@huawei.com> (raw)
In-Reply-To: <CACVXFVN2K-GYTdSwXZ2fZ9=Kgq+jXa3RCkqw+v_DcvaFBvgpew@mail.gmail.com>

On 11/10/2019 12:55, Ming Lei wrote:
> On Fri, Oct 11, 2019 at 4:54 PM John Garry <john.garry@huawei.com> wrote:
>>
>> On 10/10/2019 12:21, John Garry wrote:
>>>
>>>>
>>>> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
>>>> to real hw queue, each hw queue has to own its independent tags.
>>>> However, that isn't supported by V3 hardware.
>>>
>>> I am generating the tag internally in the driver now, so that hostwide
>>> tags issue should not be an issue.
>>>
>>> And, to be clear, I am not paying too much attention to performance, but
>>> rather just hotplugging while running IO.
>>>
>>> An update on testing:
>>> I did some scripted overnight testing. The script essentially loops like
>>> this:
>>> - online all CPUS
>>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
>>> minute
>>> - offline those CPUs
>>> - wait 1 minute (> SCSI or NVMe timeout)
>>> - and repeat
>>>
>>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
>>> some fio processes never dying with IOPS @ 0. I don't see any NVMe
>>> timeout reported. Did you do any NVMe testing of this sort?
>>>
>>
>> Yeah, so for NVMe, I see some sort of regression, like this:
>> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 1158037877d:17h:18m:22s]
>
> I can reproduce this issue, and looks there are requests in ->dispatch.

OK, that may match with what I see:
- the problem occuring coincides with this callpath with 
BLK_MQ_S_INTERNAL_STOPPED set:

blk_mq_request_bypass_insert
(__)blk_mq_try_issue_list_directly
blk_mq_sched_insert_requests
blk_mq_flush_plug_list
blk_flush_plug_list
blk_finish_plug
blkdev_direct_IO
generic_file_read_iter
blkdev_read_iter
aio_read
io_submit_one

blk_mq_request_bypass_insert() adds to the dispatch list, and looking at 
debugfs, could this be that dispatched request sitting:
root@(none)$ more /sys/kernel/debug/block/nvme0n1/hctx18/dispatch
00000000ac28511d {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, 
.tag=56, .internal_tag=-1}

So could there be some race here?

> I am a bit busy this week, please feel free to investigate it and debugfs
> can help you much. I may have time next week for looking this issue.
>

OK, appreciated

John

> Thanks,
> Ming Lei
>
>