Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue

From: Ming Lei <ming.lei@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@fb.com>,
	Long Li <longli@microsoft.com>, Christoph Hellwig <hch@lst.de>,
	linux-nvme@lists.infradead.org
Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
Date: Wed, 13 Nov 2019 10:47:53 +0800	[thread overview]
Message-ID: <20191113024753.GB28701@ming.t460p> (raw)
In-Reply-To: <4664ca6f-2ebb-c69c-5b7f-226a86394adf@grimberg.me>

On Tue, Nov 12, 2019 at 09:35:08AM -0800, Sagi Grimberg wrote:
> 
> > > > f9dde187fa92("nvme-pci: remove cq check after submission") removes
> > > > cq check after submission, this change actually causes performance
> > > > regression on some NVMe drive in which single nvmeq handles requests
> > > > originated from more than one blk-mq sw queues(call it multi-mapping
> > > > queue).
> > > > 
> > > > Actually polling IO after submission can handle IO more efficiently,
> > > > especially for multi-mapping queue:
> > > > 
> > > > 1) the poll itself is very cheap, and lockless check on cq is enough,
> > > > see nvme_cqe_pending(). Especially the check can be done after batch
> > > > submission is done.
> > > > 
> > > > 2) when IO completion is observed via the poll in submission, the requst
> > > > may be completed without interrupt involved, or the interrupt handler
> > > > overload can be decreased.
> > > > 
> > > > 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> > > > is observed via this poll, the IO can be completed locally, which is
> > > > cheaper than remote completion.
> > > > 
> > > > Follows test result done on Azure L80sv2 guest with NVMe drive(
> > > > Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> > > > numa nodes, and each NVMe drive supports 8 hw queues.
> > > 
> > > I think that the cpu lockup is a different problem, and we should
> > > separate this patch from that problem..
> > 
> > Why?
> > 
> > Most of CPU lockup is a performance issue in essence. In theory,
> > improvement in IO path could alleviate the soft lockup.
> 
> I don't think its a performance issue, being exposed to stall in hard
> irq is a fundamental issue. I don't see how this patch solves it.

As I mentioned, it is usually because CPU's interrupt handling can't catch up
with the interrupt events from hardware. Either the device generates
interrupt too fast, or the CPU isn't fast enough.

> 
> > > > 1) test script:
> > > > fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> > > > 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> > > > 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> > > > 	--name=test --group_reporting --gtod_reduce=1
> > > > 
> > > > 2) test result:
> > > >        | v5.3       | v5.3 with this patchset
> > > > -------------------------------------------
> > > > IOPS | 130K       | 424K
> > > > 
> > > > Given IO is handled more efficiently in this way, the original report
> > > > of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> > > > is applied. This issue is usually triggered when running IO from all
> > > > CPUs concurrently.
> > > > 
> > > 
> > > This is just adding code that we already removed but in a more
> > > convoluted way...
> > 
> > That commit removing the code actually causes regression for Azure
> > NVMe.
> 
> This issue has been observed long before we removed the polling from
> the submission path and the cq_lock split.
> 
> > > The correct place that should optimize the polling is aio/io_uring and
> > > not the driver locally IMO. Adding blk_poll to aio_getevents like
> > > io_uring would be a lot better I think..
> > 
> > This poll is actually one-shot poll, and I shouldn't call it poll, and
> > it should have been called as 'check cq'.
> > 
> > I believe it has been tried for supporting aio poll before, seems not
> > successful.
> 
> Is there a fundamental reason why it can work for io_uring and cannot
> work for aio?

Looks Jens has answered you.

> 
> > > > I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> > > > nodes), small improvement is observed on running the above fio over two
> > > > NVMe drive with batch 1.
> > > 
> > > Given that you add shared lock and atomic ops in the data path you are
> > > bound to hurt some latency oriented workloads in some way..
> > 
> > The spin_trylock_irqsave() is just called in case that nvme_cqe_pending() is
> > true. My test on Optane doesn't show that latency is hurt.
> 
> It also condition on the multi-mapping bit..
> 
> Can we know for a fact that this doesn't hurt what-so-ever? If so, we
> should always do it, not conditionally do it. I would test this for
> io_uring test applications that are doing heavy polling. I think

io_uring uses dedicated poll queue, which doesn't generate irq, so
not necessary to use this approach since there is poll already.

> Jens had some benchmarks he used for how fast io_uring can go in
> a single cpu core...

So far I plan to implement it as a quirk for azure's hardware since it is
because the nvme implementation applies aggressive interrupt coalescing.

The aggressive interrupt coalescing has already introduced big IO latency.

> 
> > However, I just found the Azure's NVMe is a bit special, in which
> > the 'Interrupt Coalescing' Feature register shows zero. But IO interrupt is
> > often triggered when there are many commands completed by drive.
> > 
> > For example in fio test(4k, randread aio, single job), when IOPS is
> > 110K, interrupts per second is just 13~14K. When running heavy IO, the
> > interrupts per second can just reach 40~50K at most. And for normal nvme
> > drive, if 'Interrupt Coalescing' isn't used, most of times one interrupt
> > just complete one request in the rand IO test.
> > 
> > That said Azure's implementation must apply aggressive interrupt coalescing
> > even though the register doesn't claim it.
> 
> Did you check how many completions a reaped per interrupt?

In the single job test, at average 8 compeletions per interrupt can be observed
since its IOPS is ~110K.

> 
> > That seems the root cause of soft lockup for Azure since lots of requests
> > may be handled in one interrupt event, especially when interrupt event is
> > delay-handled by CPU. Then it can explain why this patch improves Azure
> > NVNe so much in single job fio.
> > 
> > But for other drives with N:1 mapping, the soft lockup risk still exists.
> 
> As I said, we can discuss this as an optimization, but we should not
> consider this as a solution to the irq-stall issue reported on Azure as
> we agree that it doesn't solve the fundamental problem.

Azure's soft lockup is special, and really caused by aggressive interrupt
coalescing, and it has been verified that the patch can fix it, meantime
single job IOPS is improved much.

We'd still need to understand real reason of other soft lockup reports,
I saw two such RH reports on real hardware, but not get chance to investigate
them yet.

Thanks 
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme