Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue

From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>, Long Li <longli@microsoft.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Jens Axboe <axboe@fb.com>, Keith Busch <kbusch@kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
Date: Wed, 13 Nov 2019 11:05:20 +0800	[thread overview]
Message-ID: <20191113030520.GC28701@ming.t460p> (raw)
In-Reply-To: <f69d4e4c-3d6e-74c0-ed97-cac3c6b230c2@suse.de>

On Tue, Nov 12, 2019 at 06:29:34PM +0100, Hannes Reinecke wrote:
> On 11/12/19 5:49 PM, Keith Busch wrote:
> > On Tue, Nov 12, 2019 at 05:25:59PM +0100, Hannes Reinecke wrote:
> > > (Nitpick: what does happen with the interrupt if we have a mask of
> > > several CPUs? Will the interrupt delivered to one CPU?
> > > To all in the mask?
> > 
> > The hard-interrupt will be delivered to effectively one of the CPUs in the
> > mask. The one that is selected is determined when the IRQ is allocated,
> > and it should try to select one form the mask that is least used (see
> > matrix_find_best_cpu_managed()).
> > 
> Yeah, just as I thought.
> Which also means that we need to redirect the irq to a non-busy cpu to avoid
> stalls under high load.
> Expecially if we have several NVMes to deal with.

IRQ matrix tries best to assign different effective CPU to each vector for
handling interrupt.

In theory, if (nr_nvme_drives * nr_nvme_hw_queues) < nr_cpu_cores, each
hw queue may be assigned to one single effective CPU for handling the
queue's interrupt. Otherwise, one CPU may be responsible for handling
interrupts from more than 1 drive's queues. But that is just in theory,
for example, irq matrix takes admin queues into account of managed IRQ.

On Azure, there are such cases, however soft lockup still can't be
triggered after applying checking cq in submission. That means one CPU
is enough to handle two hw queue's interrupt in this case. Again,
it depends both CPU and NVMe's drive.

For network, packets flood can come any time unlimitedly, however number
of in-flight storage requests is always limited, so the situation could
be much better for storage IO than network in which NAPI may avoid the issue.

> 
> > > Can't we implement blk_poll? Or maybe even threaded interrupts?
> > 
> > Threaded interrupts sound good. Currently, though, threaded interrupts
> > execute only on the same cpu as the hard irq. There was a proposal here to
> > change that to use any CPU in the mask, and I still think it makes sense
> > 
> >    http://lists.infradead.org/pipermail/linux-nvme/2019-August/026628.html
> > 
> That looks like just the ticket.
> In combination with threaded irqs and possibly blk_poll to avoid irq storms
> we should be good.

Threaded irq can't help Azure's performance, because Azure's nvme implementation
applies aggressive interrupt coalescing.

Thanks, 
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme