Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device

From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@fb.com>
Cc: Sagi Grimberg <sagi@grimberg.me>, Long Li <longli@microsoft.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Nadolski Edmund <edmund.nadolski@intel.com>,
	Keith Busch <kbusch@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
Date: Sat, 23 Nov 2019 06:30:19 +0800	[thread overview]
Message-ID: <20191122223019.GE8700@ming.t460p> (raw)
In-Reply-To: <9ef6c1da-99c5-14f8-edb7-af50c935ce76@fb.com>

On Fri, Nov 22, 2019 at 09:58:36PM +0000, Jens Axboe wrote:
> On 11/22/19 2:49 PM, Ming Lei wrote:
> > On Fri, Nov 22, 2019 at 02:04:52PM +0000, Jens Axboe wrote:
> >> On 11/22/19 3:25 AM, Ming Lei wrote:
> >>>> as that will still overload the one cpu that the interrupt handler was
> >>>> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
> >>>
> >>> Actually one CPU is fast enough to handle several drive's interrupt handling.
> >>> Also there is per-queue depth limit, and the interrupt flood issue in network
> >>> can't be serious on storage.
> >>
> >> This is true today, but it won't be true in the future. Lets aim for a
> >> solution that's a little more future proof than just "enough today", if
> >> we're going to make changes in this area.
> > 
> > That should be a new feature for future hardware, and we don't know any
> > performance details, and it can be hard to prepare for it now. Maybe
> > such hardware or case never comes:
> 
> Oh it'll surely come, and maybe sooner than you think. My point is that
> using "one CPU is fast enough to handle several drive interrupts" is
> very shortsighted, and probably not even true today.

That single CPU is responsible for handling more than one drives should
only happen in case that the following condition is true:

	nr_drives * nr_io_hw_queue > nr_cpus

> 
> > - storage device has queue depth, which limits the max in-flight requests
> > to be handled in each queue's interrupt handler.
> 
> Only if new requests aren't also coming in and completing while you are
> doing that work.
> > 
> > - Suppose such fast hardware comes, it isn't reasonable for them
> > to support N:1 mapping(N is big).
> 
> Very true, in fact that's already pretty damn dumb today...

OK, I guess it is because lots of NVMe only supports limited hw
queues(32).

> 
> > - Also IRQ matrix has balanced interrupt handling loading already, that
> > said most of times, one CPU is just responsible for handing one hw queue's
> > interrupt. Even in Azure's case, 8 CPUs are mapped to one hw queue, but
> > there is just several CPUs which is for responsible for at most 2 hw queues.

It also depends on how many drives are used on single machine. The issue
is possible only when the number of drives is big enough. I guess it
isn't unusual.

> > 
> > So could we focus on now and fix the regression first?
> 
> As far as I could tell from the other message, sounds like they both
> have broken interrupt coalescing? Makes it harder to care, honestly...

Yeah, I found two reports on two different drives, both can
be fixed by this patch. Not see other reports which is caused by
too much interrupt loading on single CPU. That is why I tried to
avoid generic approach...

> 
> But yes, I think we should do something about this. This really isn't a
> new issue, if a core gets overloaded just doing completions from
> interrupts, we should punt the work. NAPI has been doing that for ages,
> and the block layer also used to have support it, but nobody used it.
> Would be a great idea to make a blk-mq friendly version of that, with
> the kinds of IOPS and latencies in mind that we see today and in the
> coming years. I don't think hacking around this in the nvme driver is a
> very good way to go about it.

OK, I will look at this approach, and Sagi has posted one such patch.

thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme