Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue

From: Hannes Reinecke <hare@suse.de>
To: Ming Lei <ming.lei@redhat.com>, Long Li <longli@microsoft.com>
Cc: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@fb.com>,
	Christoph Hellwig <hch@lst.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Sagi Grimberg <sagi@grimberg.me>
Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
Date: Tue, 12 Nov 2019 17:25:59 +0100	[thread overview]
Message-ID: <8198fd99-6b47-7594-ba1c-4a15ffe25269@suse.de> (raw)
In-Reply-To: <20191112023920.GD15079@ming.t460p>

On 11/12/19 3:39 AM, Ming Lei wrote:
> On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
>>> From: Christoph Hellwig <hch@lst.de>
>>> Sent: Monday, November 11, 2019 12:45 PM
>>> To: Ming Lei <ming.lei@redhat.com>
>>> Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>; Jens
>>> Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi Grimberg
>>> <sagi@grimberg.me>; Long Li <longli@microsoft.com>
>>> Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>>> mapping queue
>>>
>>> On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>>>> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq
>>>> check after submission, this change actually causes performance
>>>> regression on some NVMe drive in which single nvmeq handles requests
>>>> originated from more than one blk-mq sw queues(call it multi-mapping
>>>> queue).
>>>
>>>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>>>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa
>>>> nodes, and each NVMe drive supports 8 hw queues.
>>>
>>> Have you actually seen this on a real nvme drive as well?
>>>
>>> Note that it is kinda silly to limit queues like that in VMs, so I really don't think
>>> we should optimize the driver for this particular case.
>>
>> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD (with 32 hardware queues). Tests also showed soft lockup when 32 queues are shared by 80 CPUs. 
>>
> 
> BTW, do you see if this simple change makes a difference?
> 
>> The issue will likely show up if the number of NVMe hardware queues is less than the number of CPUs. I think this may be a likely configuration on a very large system. (e.g. the largest VM on Azure has 416 cores)
>>
> 
> 'the number of NVMe hardware queues' above should be the number of single NVMe drive.
> I believe 32 hw queues is common, also poll queues may take several from the total 32.
> When interrupt handling on single CPU core can't catch up with NVMe's IO handling,
> soft lockup could be triggered. Of course, there are lot kinds of supported processors
> by Linux.
> 
But then we should rather work on eliminating the soft lockup itself.
Switching to polling for completions on the same CPU isn't going to
help; you just stall all other NVMe's which might be waiting for
interrupts arriving on this CPU.
(Nitpick: what does happen with the interrupt if we have a mask of
several CPUs? Will the interrupt delivered to one CPU?
To all in the mask? And if that, how do the other CPU cores notice that
one is working on that interrupt? Questions ...)

Can't we implement blk_poll? Or maybe even threaded interrupts?

> Also when (nr_nvme_drives * nr_nvme_hw_queues) > nr_cpu_cores, one same CPU
> can be assigned to handle more than 1 nvme IO queue interrupt from different
> NVMe drive, the situation becomes worse.
> 
That is arguably bad; especially so as we're doing automatic interrupt
affinity.

-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme