[PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission
@ 2019-11-08  3:55 Ming Lei
  2019-11-08  3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
  2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
  0 siblings, 2 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-08  3:55 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Keith Busch,
	Christoph Hellwig

Hi,

In case that single nvme queue is mapped from multiple blk-mq sw queue,
this patchset improves IO handling by polling after batch submission.

Turns out that IOPS is improved much in single job aio test.

Also it is observed that the CPU lockup issue in Hyper V guest can't
be triggered any more after this patch is applied.

Thanks,

Ming Lei (2):
  nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  nvme-pci: poll IO after batch submission for multi-mapping queue

 drivers/nvme/host/pci.c | 93 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 82 insertions(+), 11 deletions(-)

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>

-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  2019-11-08  3:55 [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission Ming Lei
@ 2019-11-08  3:55 ` Ming Lei
  2019-11-08  4:12   ` Keith Busch
  2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-11-08  3:55 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Keith Busch,
	Christoph Hellwig

Prepare for adding new cq lock for improving IO performance in case
of multi-mapping queue, and the new added cq_lock can share space with
cq_poll_lock.

Move sq/cq_poll lock initialization into nvme_init_queue so that
we may initialize the added cq_lock correctly and lockdep warning
can be avoided.

This way is safe because no IO can be started until nvme_init_queue
is returned.

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index bb88681f4dc3..5b20ab4d21d3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1480,8 +1480,6 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth)
 		goto free_cqdma;
 
 	nvmeq->dev = dev;
-	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_poll_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
@@ -1515,6 +1513,9 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 
+	spin_lock_init(&nvmeq->sq_lock);
+	spin_lock_init(&nvmeq->cq_poll_lock);
+
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
 	nvmeq->cq_head = 0;
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-08  3:55 [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission Ming Lei
  2019-11-08  3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
@ 2019-11-08  3:55 ` Ming Lei
  2019-11-11 20:44   ` Christoph Hellwig
                     ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-08  3:55 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Keith Busch,
	Christoph Hellwig

f9dde187fa92("nvme-pci: remove cq check after submission") removes
cq check after submission, this change actually causes performance
regression on some NVMe drive in which single nvmeq handles requests
originated from more than one blk-mq sw queues(call it multi-mapping
queue).

Actually polling IO after submission can handle IO more efficiently,
especially for multi-mapping queue:

1) the poll itself is very cheap, and lockless check on cq is enough,
see nvme_cqe_pending(). Especially the check can be done after batch
submission is done.

2) when IO completion is observed via the poll in submission, the requst
may be completed without interrupt involved, or the interrupt handler
overload can be decreased.

3) when single sw queue is submiting IOs to this hw queue, if IOs completion
is observed via this poll, the IO can be completed locally, which is
cheaper than remote completion.

Follows test result done on Azure L80sv2 guest with NVMe drive(
Microsoft Corporation Device b111). This guest has 80 CPUs and 10
numa nodes, and each NVMe drive supports 8 hw queues.

1) test script:
fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
	--name=test --group_reporting --gtod_reduce=1

2) test result:
     | v5.3       | v5.3 with this patchset
-------------------------------------------
IOPS | 130K       | 424K

Given IO is handled more efficiently in this way, the original report
of CPU lockup[1] on Hyper-V can't be observed any more after this patch
is applied. This issue is usually triggered when running IO from all
CPUs concurrently.

I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
nodes), small improvement is observed on running the above fio over two
NVMe drive with batch 1.

[1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Fixes: f9dde187fa92("nvme-pci: remove cq check after submission")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 90 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 80 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b20ab4d21d3..880376f43897 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -83,6 +83,7 @@ struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
+static void nvme_poll_in_submission(struct nvme_queue *nvmeq);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -165,7 +166,10 @@ struct nvme_queue {
 	spinlock_t sq_lock;
 	void *sq_cmds;
 	 /* only used for poll queues: */
-	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
+	union {
+		spinlock_t cq_poll_lock;
+		spinlock_t cq_lock;
+	}____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -185,6 +189,7 @@ struct nvme_queue {
 #define NVMEQ_SQ_CMB		1
 #define NVMEQ_DELETE_ERROR	2
 #define NVMEQ_POLLED		3
+#define NVMEQ_MULTI_MAPPING	4
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -911,6 +916,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(req);
 	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
+
+	if (bd->last)
+		nvme_poll_in_submission(nvmeq);
+
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
@@ -1016,6 +1025,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
 	return found;
 }
 
+static inline irqreturn_t
+nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end)
+{
+	irqreturn_t ret = IRQ_NONE;
+
+	if (nvmeq->cq_head != nvmeq->last_cq_head)
+		ret = IRQ_HANDLED;
+	nvme_process_cq(nvmeq, start, end, -1);
+	nvmeq->last_cq_head = nvmeq->cq_head;
+
+	return ret;
+}
+
 static irqreturn_t nvme_irq(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
@@ -1027,10 +1049,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	 * the irq handler, even if that was on another CPU.
 	 */
 	rmb();
-	if (nvmeq->cq_head != nvmeq->last_cq_head)
-		ret = IRQ_HANDLED;
-	nvme_process_cq(nvmeq, &start, &end, -1);
-	nvmeq->last_cq_head = nvmeq->cq_head;
+	ret = nvme_update_cq(nvmeq, &start, &end);
 	wmb();
 
 	if (start != end) {
@@ -1041,6 +1060,24 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
+static irqreturn_t nvme_irq_multi_mapping(int irq, void *data)
+{
+	struct nvme_queue *nvmeq = data;
+	irqreturn_t ret = IRQ_NONE;
+	u16 start, end;
+
+	spin_lock(&nvmeq->cq_lock);
+	ret = nvme_update_cq(nvmeq, &start, &end);
+	spin_unlock(&nvmeq->cq_lock);
+
+	if (start != end) {
+		nvme_complete_cqes(nvmeq, start, end);
+		return IRQ_HANDLED;
+	}
+
+	return ret;
+}
+
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
@@ -1049,6 +1086,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
+static void nvme_poll_in_submission(struct nvme_queue *nvmeq)
+{
+	if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) &&
+			nvme_cqe_pending(nvmeq)) {
+		unsigned long flags;
+
+		if (spin_trylock_irqsave(&nvmeq->cq_lock, flags)) {
+			u16 start, end;
+
+			nvme_update_cq(nvmeq, &start, &end);
+			spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
+
+			if (start != end)
+				nvme_complete_cqes(nvmeq, start, end);
+		}
+	}
+}
+
 /*
  * Poll for completions any queue, including those not dedicated to polling.
  * Can be called from any context.
@@ -1499,12 +1554,14 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 {
 	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
 	int nr = nvmeq->dev->ctrl.instance;
+	irq_handler_t handler = test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) ?
+		nvme_irq_multi_mapping : nvme_irq;
 
 	if (use_threaded_interrupts) {
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+				handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	} else {
-		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
+		return pci_request_irq(pdev, nvmeq->cq_vector, handler,
 				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	}
 }
@@ -1514,7 +1571,13 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	struct nvme_dev *dev = nvmeq->dev;
 
 	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_poll_lock);
+
+	if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags)) {
+		WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
+		spin_lock_init(&nvmeq->cq_lock);
+	} else {
+		spin_lock_init(&nvmeq->cq_poll_lock);
+	}
 
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
@@ -1534,15 +1597,22 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	u16 vector = 0;
 
 	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
+	clear_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags);
 
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	if (!polled)
+	if (!polled) {
+		struct pci_dev *pdev = to_pci_dev(dev->dev);
+
 		vector = dev->num_vecs == 1 ? 0 : qid;
-	else
+		if (vector && cpumask_weight(pci_irq_get_affinity(pdev,
+						vector)) > 1)
+			set_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags);
+	} else {
 		set_bit(NVMEQ_POLLED, &nvmeq->flags);
+	}
 
 	result = adapter_alloc_cq(dev, qid, nvmeq, vector);
 	if (result)
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  2019-11-08  3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
@ 2019-11-08  4:12   ` Keith Busch
  2019-11-08  7:09     ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-11-08  4:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Long Li, Christoph Hellwig, linux-nvme, Sagi Grimberg

On Fri, Nov 08, 2019 at 11:55:07AM +0800, Ming Lei wrote:
> Prepare for adding new cq lock for improving IO performance in case
> of multi-mapping queue, and the new added cq_lock can share space with
> cq_poll_lock.
> 
> Move sq/cq_poll lock initialization into nvme_init_queue so that
> we may initialize the added cq_lock correctly and lockdep warning
> can be avoided.
> 
> This way is safe because no IO can be started until nvme_init_queue
> is returned.

Yes, but we call nvme_init_queue() on every reset. I think it looks it's
okay to reinit a lock as long as it's not held, but it really shouldn't
be necessary, right?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  2019-11-08  4:12   ` Keith Busch
@ 2019-11-08  7:09     ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-08  7:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Long Li, Christoph Hellwig, linux-nvme, Sagi Grimberg

On Fri, Nov 08, 2019 at 01:12:51PM +0900, Keith Busch wrote:
> On Fri, Nov 08, 2019 at 11:55:07AM +0800, Ming Lei wrote:
> > Prepare for adding new cq lock for improving IO performance in case
> > of multi-mapping queue, and the new added cq_lock can share space with
> > cq_poll_lock.
> > 
> > Move sq/cq_poll lock initialization into nvme_init_queue so that
> > we may initialize the added cq_lock correctly and lockdep warning
> > can be avoided.
> > 
> > This way is safe because no IO can be started until nvme_init_queue
> > is returned.
> 
> Yes, but we call nvme_init_queue() on every reset. I think it looks it's
> okay to reinit a lock as long as it's not held, but it really shouldn't
> be necessary, right?

Yeah, however, this patch doesn't change the current behavior.

Looks not a big deal given the queue is really quiesced during
nvme_init_queue().

Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
@ 2019-11-11 20:44   ` Christoph Hellwig
  2019-11-12  0:33     ` Long Li
  2019-11-12  2:07     ` Ming Lei
  2019-11-12  1:44   ` Sagi Grimberg
  2019-11-12 18:11   ` Nadolski, Edmund
  2 siblings, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2019-11-11 20:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Keith Busch,
	Christoph Hellwig

On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
> f9dde187fa92("nvme-pci: remove cq check after submission") removes
> cq check after submission, this change actually causes performance
> regression on some NVMe drive in which single nvmeq handles requests
> originated from more than one blk-mq sw queues(call it multi-mapping
> queue).

> Follows test result done on Azure L80sv2 guest with NVMe drive(
> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> numa nodes, and each NVMe drive supports 8 hw queues.

Have you actually seen this on a real nvme drive as well?

Note that it is kinda silly to limit queues like that in VMs, so I
really don't think we should optimize the driver for this particular
case.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-11 20:44   ` Christoph Hellwig
@ 2019-11-12  0:33     ` Long Li
  2019-11-12  1:35       ` Sagi Grimberg
  2019-11-12  2:39       ` Ming Lei
  2019-11-12  2:07     ` Ming Lei
  1 sibling, 2 replies; 28+ messages in thread
From: Long Li @ 2019-11-12  0:33 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme

>From: Christoph Hellwig <hch@lst.de>
>Sent: Monday, November 11, 2019 12:45 PM
>To: Ming Lei <ming.lei@redhat.com>
>Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>; Jens
>Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi Grimberg
><sagi@grimberg.me>; Long Li <longli@microsoft.com>
>Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>mapping queue
>
>On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq
>> check after submission, this change actually causes performance
>> regression on some NVMe drive in which single nvmeq handles requests
>> originated from more than one blk-mq sw queues(call it multi-mapping
>> queue).
>
>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa
>> nodes, and each NVMe drive supports 8 hw queues.
>
>Have you actually seen this on a real nvme drive as well?
>
>Note that it is kinda silly to limit queues like that in VMs, so I really don't think
>we should optimize the driver for this particular case.

I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD (with 32 hardware queues). Tests also showed soft lockup when 32 queues are shared by 80 CPUs. 

The issue will likely show up if the number of NVMe hardware queues is less than the number of CPUs. I think this may be a likely configuration on a very large system. (e.g. the largest VM on Azure has 416 cores)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  0:33     ` Long Li
@ 2019-11-12  1:35       ` Sagi Grimberg
  2019-11-12  2:39       ` Ming Lei
  1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2019-11-12  1:35 UTC (permalink / raw)
  To: Long Li, Christoph Hellwig, Ming Lei; +Cc: Keith Busch, Jens Axboe, linux-nvme


>> Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>> mapping queue
>>
>> On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>>> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq
>>> check after submission, this change actually causes performance
>>> regression on some NVMe drive in which single nvmeq handles requests
>>> originated from more than one blk-mq sw queues(call it multi-mapping
>>> queue).
>>
>>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa
>>> nodes, and each NVMe drive supports 8 hw queues.
>>
>> Have you actually seen this on a real nvme drive as well?
>>
>> Note that it is kinda silly to limit queues like that in VMs, so I really don't think
>> we should optimize the driver for this particular case.
> 
> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD (with 32 hardware queues). Tests also showed soft lockup when 32 queues are shared by 80 CPUs.
> 
> The issue will likely show up if the number of NVMe hardware queues is less than the number of CPUs. I think this may be a likely configuration on a very large system. (e.g. the largest VM on Azure has 416 cores)

This makes sense. As long as there is a cpu core that keeps feeding the
sq nothing prevents from the irq handler to run forever...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
  2019-11-11 20:44   ` Christoph Hellwig
@ 2019-11-12  1:44   ` Sagi Grimberg
  2019-11-12  9:56     ` Ming Lei
  2019-11-12 18:11   ` Nadolski, Edmund
  2 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2019-11-12  1:44 UTC (permalink / raw)
  To: Ming Lei, linux-nvme; +Cc: Keith Busch, Jens Axboe, Long Li, Christoph Hellwig


> f9dde187fa92("nvme-pci: remove cq check after submission") removes
> cq check after submission, this change actually causes performance
> regression on some NVMe drive in which single nvmeq handles requests
> originated from more than one blk-mq sw queues(call it multi-mapping
> queue).
> 
> Actually polling IO after submission can handle IO more efficiently,
> especially for multi-mapping queue:
> 
> 1) the poll itself is very cheap, and lockless check on cq is enough,
> see nvme_cqe_pending(). Especially the check can be done after batch
> submission is done.
> 
> 2) when IO completion is observed via the poll in submission, the requst
> may be completed without interrupt involved, or the interrupt handler
> overload can be decreased.
> 
> 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> is observed via this poll, the IO can be completed locally, which is
> cheaper than remote completion.
> 
> Follows test result done on Azure L80sv2 guest with NVMe drive(
> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> numa nodes, and each NVMe drive supports 8 hw queues.

I think that the cpu lockup is a different problem, and we should
separate this patch from that problem..

> 
> 1) test script:
> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> 	--name=test --group_reporting --gtod_reduce=1
> 
> 2) test result:
>       | v5.3       | v5.3 with this patchset
> -------------------------------------------
> IOPS | 130K       | 424K
> 
> Given IO is handled more efficiently in this way, the original report
> of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> is applied. This issue is usually triggered when running IO from all
> CPUs concurrently.
> 

This is just adding code that we already removed but in a more
convoluted way...

The correct place that should optimize the polling is aio/io_uring and
not the driver locally IMO. Adding blk_poll to aio_getevents like
io_uring would be a lot better I think..

> I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> nodes), small improvement is observed on running the above fio over two
> NVMe drive with batch 1.

Given that you add shared lock and atomic ops in the data path you are
bound to hurt some latency oriented workloads in some way..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-11 20:44   ` Christoph Hellwig
  2019-11-12  0:33     ` Long Li
@ 2019-11-12  2:07     ` Ming Lei
  1 sibling, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-12  2:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Long Li, Sagi Grimberg, linux-nvme

On Mon, Nov 11, 2019 at 09:44:46PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
> > f9dde187fa92("nvme-pci: remove cq check after submission") removes
> > cq check after submission, this change actually causes performance
> > regression on some NVMe drive in which single nvmeq handles requests
> > originated from more than one blk-mq sw queues(call it multi-mapping
> > queue).
> 
> > Follows test result done on Azure L80sv2 guest with NVMe drive(
> > Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> > numa nodes, and each NVMe drive supports 8 hw queues.
> 
> Have you actually seen this on a real nvme drive as well?
> 
> Note that it is kinda silly to limit queues like that in VMs, so I
> really don't think we should optimize the driver for this particular
> case.
> 

When I saw the report at first glance, I had same idea with you,
however recently I got 3 such report, in which two of them only have 8
hw queues, one is Azure, another is on real server. Both are deployed
massively in production environment.

Azure's NVMe drive should be real, and I guess it is PCI pass-through,
given its IOPS can reach >400K in single fio job, which is actually good
enough compared with real nvme drive.

Wrt. limit queues, actually it is a bit common, since I saw at least two
Intel NVMes(P3700, and Optane) limits queue count as 32. When these
NVMe are used in big machines, the soft lockup issue could be triggered,
especially more nvmes are installed and the system uses a bit slow
processor, or memory.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  0:33     ` Long Li
  2019-11-12  1:35       ` Sagi Grimberg
@ 2019-11-12  2:39       ` Ming Lei
  2019-11-12 16:25         ` Hannes Reinecke
  2019-11-12 21:20         ` Long Li
  1 sibling, 2 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-12  2:39 UTC (permalink / raw)
  To: Long Li
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
> >From: Christoph Hellwig <hch@lst.de>
> >Sent: Monday, November 11, 2019 12:45 PM
> >To: Ming Lei <ming.lei@redhat.com>
> >Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>; Jens
> >Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi Grimberg
> ><sagi@grimberg.me>; Long Li <longli@microsoft.com>
> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
> >mapping queue
> >
> >On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
> >> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq
> >> check after submission, this change actually causes performance
> >> regression on some NVMe drive in which single nvmeq handles requests
> >> originated from more than one blk-mq sw queues(call it multi-mapping
> >> queue).
> >
> >> Follows test result done on Azure L80sv2 guest with NVMe drive(
> >> Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa
> >> nodes, and each NVMe drive supports 8 hw queues.
> >
> >Have you actually seen this on a real nvme drive as well?
> >
> >Note that it is kinda silly to limit queues like that in VMs, so I really don't think
> >we should optimize the driver for this particular case.
> 
> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD (with 32 hardware queues). Tests also showed soft lockup when 32 queues are shared by 80 CPUs. 
> 

BTW, do you see if this simple change makes a difference?

> The issue will likely show up if the number of NVMe hardware queues is less than the number of CPUs. I think this may be a likely configuration on a very large system. (e.g. the largest VM on Azure has 416 cores)
> 

'the number of NVMe hardware queues' above should be the number of single NVMe drive.
I believe 32 hw queues is common, also poll queues may take several from the total 32.
When interrupt handling on single CPU core can't catch up with NVMe's IO handling,
soft lockup could be triggered. Of course, there are lot kinds of supported processors
by Linux.

Also when (nr_nvme_drives * nr_nvme_hw_queues) > nr_cpu_cores, one same CPU
can be assigned to handle more than 1 nvme IO queue interrupt from different
NVMe drive, the situation becomes worse.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  1:44   ` Sagi Grimberg
@ 2019-11-12  9:56     ` Ming Lei
  2019-11-12 17:35       ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-11-12  9:56 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, Long Li, Christoph Hellwig, linux-nvme

On Mon, Nov 11, 2019 at 05:44:10PM -0800, Sagi Grimberg wrote:
> 
> > f9dde187fa92("nvme-pci: remove cq check after submission") removes
> > cq check after submission, this change actually causes performance
> > regression on some NVMe drive in which single nvmeq handles requests
> > originated from more than one blk-mq sw queues(call it multi-mapping
> > queue).
> > 
> > Actually polling IO after submission can handle IO more efficiently,
> > especially for multi-mapping queue:
> > 
> > 1) the poll itself is very cheap, and lockless check on cq is enough,
> > see nvme_cqe_pending(). Especially the check can be done after batch
> > submission is done.
> > 
> > 2) when IO completion is observed via the poll in submission, the requst
> > may be completed without interrupt involved, or the interrupt handler
> > overload can be decreased.
> > 
> > 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> > is observed via this poll, the IO can be completed locally, which is
> > cheaper than remote completion.
> > 
> > Follows test result done on Azure L80sv2 guest with NVMe drive(
> > Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> > numa nodes, and each NVMe drive supports 8 hw queues.
> 
> I think that the cpu lockup is a different problem, and we should
> separate this patch from that problem..

Why?

Most of CPU lockup is a performance issue in essence. In theory,
improvement in IO path could alleviate the soft lockup. 

> 
> > 
> > 1) test script:
> > fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> > 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> > 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> > 	--name=test --group_reporting --gtod_reduce=1
> > 
> > 2) test result:
> >       | v5.3       | v5.3 with this patchset
> > -------------------------------------------
> > IOPS | 130K       | 424K
> > 
> > Given IO is handled more efficiently in this way, the original report
> > of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> > is applied. This issue is usually triggered when running IO from all
> > CPUs concurrently.
> > 
> 
> This is just adding code that we already removed but in a more
> convoluted way...

That commit removing the code actually causes regression for Azure
NVMe.

> 
> The correct place that should optimize the polling is aio/io_uring and
> not the driver locally IMO. Adding blk_poll to aio_getevents like
> io_uring would be a lot better I think..

This poll is actually one-shot poll, and I shouldn't call it poll, and
it should have been called as 'check cq'.

I believe it has been tried for supporting aio poll before, seems not
successful.

> 
> > I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> > nodes), small improvement is observed on running the above fio over two
> > NVMe drive with batch 1.
> 
> Given that you add shared lock and atomic ops in the data path you are
> bound to hurt some latency oriented workloads in some way..

The spin_trylock_irqsave() is just called in case that nvme_cqe_pending() is
true. My test on Optane doesn't show that latency is hurt.

However, I just found the Azure's NVMe is a bit special, in which
the 'Interrupt Coalescing' Feature register shows zero. But IO interrupt is
often triggered when there are many commands completed by drive.

For example in fio test(4k, randread aio, single job), when IOPS is
110K, interrupts per second is just 13~14K. When running heavy IO, the
interrupts per second can just reach 40~50K at most. And for normal nvme
drive, if 'Interrupt Coalescing' isn't used, most of times one interrupt
just complete one request in the rand IO test.

That said Azure's implementation must apply aggressive interrupt coalescing
even though the register doesn't claim it.

That seems the root cause of soft lockup for Azure since lots of requests
may be handled in one interrupt event, especially when interrupt event is
delay-handled by CPU. Then it can explain why this patch improves Azure
NVNe so much in single job fio.

But for other drives with N:1 mapping, the soft lockup risk still exists.

thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  2:39       ` Ming Lei
@ 2019-11-12 16:25         ` Hannes Reinecke
  2019-11-12 16:49           ` Keith Busch
  2019-11-12 21:20         ` Long Li
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-11-12 16:25 UTC (permalink / raw)
  To: Ming Lei, Long Li
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

On 11/12/19 3:39 AM, Ming Lei wrote:
> On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
>>> From: Christoph Hellwig <hch@lst.de>
>>> Sent: Monday, November 11, 2019 12:45 PM
>>> To: Ming Lei <ming.lei@redhat.com>
>>> Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>; Jens
>>> Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi Grimberg
>>> <sagi@grimberg.me>; Long Li <longli@microsoft.com>
>>> Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>>> mapping queue
>>>
>>> On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>>>> f9dde187fa92("nvme-pci: remove cq check after submission") removes cq
>>>> check after submission, this change actually causes performance
>>>> regression on some NVMe drive in which single nvmeq handles requests
>>>> originated from more than one blk-mq sw queues(call it multi-mapping
>>>> queue).
>>>
>>>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>>>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10 numa
>>>> nodes, and each NVMe drive supports 8 hw queues.
>>>
>>> Have you actually seen this on a real nvme drive as well?
>>>
>>> Note that it is kinda silly to limit queues like that in VMs, so I really don't think
>>> we should optimize the driver for this particular case.
>>
>> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD (with 32 hardware queues). Tests also showed soft lockup when 32 queues are shared by 80 CPUs. 
>>
> 
> BTW, do you see if this simple change makes a difference?
> 
>> The issue will likely show up if the number of NVMe hardware queues is less than the number of CPUs. I think this may be a likely configuration on a very large system. (e.g. the largest VM on Azure has 416 cores)
>>
> 
> 'the number of NVMe hardware queues' above should be the number of single NVMe drive.
> I believe 32 hw queues is common, also poll queues may take several from the total 32.
> When interrupt handling on single CPU core can't catch up with NVMe's IO handling,
> soft lockup could be triggered. Of course, there are lot kinds of supported processors
> by Linux.
> 
But then we should rather work on eliminating the soft lockup itself.
Switching to polling for completions on the same CPU isn't going to
help; you just stall all other NVMe's which might be waiting for
interrupts arriving on this CPU.
(Nitpick: what does happen with the interrupt if we have a mask of
several CPUs? Will the interrupt delivered to one CPU?
To all in the mask? And if that, how do the other CPU cores notice that
one is working on that interrupt? Questions ...)

Can't we implement blk_poll? Or maybe even threaded interrupts?

> Also when (nr_nvme_drives * nr_nvme_hw_queues) > nr_cpu_cores, one same CPU
> can be assigned to handle more than 1 nvme IO queue interrupt from different
> NVMe drive, the situation becomes worse.
> 
That is arguably bad; especially so as we're doing automatic interrupt
affinity.

-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 16:25         ` Hannes Reinecke
@ 2019-11-12 16:49           ` Keith Busch
  2019-11-12 17:29             ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-11-12 16:49 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Long Li, linux-nvme, Ming Lei, Jens Axboe,
	Christoph Hellwig

On Tue, Nov 12, 2019 at 05:25:59PM +0100, Hannes Reinecke wrote:
> (Nitpick: what does happen with the interrupt if we have a mask of
> several CPUs? Will the interrupt delivered to one CPU?
> To all in the mask?

The hard-interrupt will be delivered to effectively one of the CPUs in the
mask. The one that is selected is determined when the IRQ is allocated,
and it should try to select one form the mask that is least used (see
matrix_find_best_cpu_managed()).

> Can't we implement blk_poll? Or maybe even threaded interrupts?

Threaded interrupts sound good. Currently, though, threaded interrupts
execute only on the same cpu as the hard irq. There was a proposal here to
change that to use any CPU in the mask, and I still think it makes sense

  http://lists.infradead.org/pipermail/linux-nvme/2019-August/026628.html

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 16:49           ` Keith Busch
@ 2019-11-12 17:29             ` Hannes Reinecke
  2019-11-13  3:05               ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-11-12 17:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Long Li, linux-nvme, Ming Lei, Jens Axboe,
	Christoph Hellwig

On 11/12/19 5:49 PM, Keith Busch wrote:
> On Tue, Nov 12, 2019 at 05:25:59PM +0100, Hannes Reinecke wrote:
>> (Nitpick: what does happen with the interrupt if we have a mask of
>> several CPUs? Will the interrupt delivered to one CPU?
>> To all in the mask?
> 
> The hard-interrupt will be delivered to effectively one of the CPUs in the
> mask. The one that is selected is determined when the IRQ is allocated,
> and it should try to select one form the mask that is least used (see
> matrix_find_best_cpu_managed()).
> 
Yeah, just as I thought.
Which also means that we need to redirect the irq to a non-busy cpu to 
avoid stalls under high load.
Expecially if we have several NVMes to deal with.

>> Can't we implement blk_poll? Or maybe even threaded interrupts?
> 
> Threaded interrupts sound good. Currently, though, threaded interrupts
> execute only on the same cpu as the hard irq. There was a proposal here to
> change that to use any CPU in the mask, and I still think it makes sense
> 
>    http://lists.infradead.org/pipermail/linux-nvme/2019-August/026628.html
> 
That looks like just the ticket.
In combination with threaded irqs and possibly blk_poll to avoid irq 
storms we should be good.

Let's see if I can come up with something...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  9:56     ` Ming Lei
@ 2019-11-12 17:35       ` Sagi Grimberg
  2019-11-12 21:17         ` Long Li
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Sagi Grimberg @ 2019-11-12 17:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, Jens Axboe, Long Li, Christoph Hellwig, linux-nvme


>>> f9dde187fa92("nvme-pci: remove cq check after submission") removes
>>> cq check after submission, this change actually causes performance
>>> regression on some NVMe drive in which single nvmeq handles requests
>>> originated from more than one blk-mq sw queues(call it multi-mapping
>>> queue).
>>>
>>> Actually polling IO after submission can handle IO more efficiently,
>>> especially for multi-mapping queue:
>>>
>>> 1) the poll itself is very cheap, and lockless check on cq is enough,
>>> see nvme_cqe_pending(). Especially the check can be done after batch
>>> submission is done.
>>>
>>> 2) when IO completion is observed via the poll in submission, the requst
>>> may be completed without interrupt involved, or the interrupt handler
>>> overload can be decreased.
>>>
>>> 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
>>> is observed via this poll, the IO can be completed locally, which is
>>> cheaper than remote completion.
>>>
>>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
>>> numa nodes, and each NVMe drive supports 8 hw queues.
>>
>> I think that the cpu lockup is a different problem, and we should
>> separate this patch from that problem..
> 
> Why?
> 
> Most of CPU lockup is a performance issue in essence. In theory,
> improvement in IO path could alleviate the soft lockup.

I don't think its a performance issue, being exposed to stall in hard
irq is a fundamental issue. I don't see how this patch solves it.

>>> 1) test script:
>>> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
>>> 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
>>> 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
>>> 	--name=test --group_reporting --gtod_reduce=1
>>>
>>> 2) test result:
>>>        | v5.3       | v5.3 with this patchset
>>> -------------------------------------------
>>> IOPS | 130K       | 424K
>>>
>>> Given IO is handled more efficiently in this way, the original report
>>> of CPU lockup[1] on Hyper-V can't be observed any more after this patch
>>> is applied. This issue is usually triggered when running IO from all
>>> CPUs concurrently.
>>>
>>
>> This is just adding code that we already removed but in a more
>> convoluted way...
> 
> That commit removing the code actually causes regression for Azure
> NVMe.

This issue has been observed long before we removed the polling from
the submission path and the cq_lock split.

>> The correct place that should optimize the polling is aio/io_uring and
>> not the driver locally IMO. Adding blk_poll to aio_getevents like
>> io_uring would be a lot better I think..
> 
> This poll is actually one-shot poll, and I shouldn't call it poll, and
> it should have been called as 'check cq'.
> 
> I believe it has been tried for supporting aio poll before, seems not
> successful.

Is there a fundamental reason why it can work for io_uring and cannot
work for aio?

>>> I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
>>> nodes), small improvement is observed on running the above fio over two
>>> NVMe drive with batch 1.
>>
>> Given that you add shared lock and atomic ops in the data path you are
>> bound to hurt some latency oriented workloads in some way..
> 
> The spin_trylock_irqsave() is just called in case that nvme_cqe_pending() is
> true. My test on Optane doesn't show that latency is hurt.

It also condition on the multi-mapping bit..

Can we know for a fact that this doesn't hurt what-so-ever? If so, we
should always do it, not conditionally do it. I would test this for
io_uring test applications that are doing heavy polling. I think
Jens had some benchmarks he used for how fast io_uring can go in
a single cpu core...

> However, I just found the Azure's NVMe is a bit special, in which
> the 'Interrupt Coalescing' Feature register shows zero. But IO interrupt is
> often triggered when there are many commands completed by drive.
> 
> For example in fio test(4k, randread aio, single job), when IOPS is
> 110K, interrupts per second is just 13~14K. When running heavy IO, the
> interrupts per second can just reach 40~50K at most. And for normal nvme
> drive, if 'Interrupt Coalescing' isn't used, most of times one interrupt
> just complete one request in the rand IO test.
> 
> That said Azure's implementation must apply aggressive interrupt coalescing
> even though the register doesn't claim it.

Did you check how many completions a reaped per interrupt?

> That seems the root cause of soft lockup for Azure since lots of requests
> may be handled in one interrupt event, especially when interrupt event is
> delay-handled by CPU. Then it can explain why this patch improves Azure
> NVNe so much in single job fio.
> 
> But for other drives with N:1 mapping, the soft lockup risk still exists.

As I said, we can discuss this as an optimization, but we should not
consider this as a solution to the irq-stall issue reported on Azure as
we agree that it doesn't solve the fundamental problem.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
  2019-11-11 20:44   ` Christoph Hellwig
  2019-11-12  1:44   ` Sagi Grimberg
@ 2019-11-12 18:11   ` Nadolski, Edmund
  2019-11-13 13:46     ` Ming Lei
  2 siblings, 1 reply; 28+ messages in thread
From: Nadolski, Edmund @ 2019-11-12 18:11 UTC (permalink / raw)
  To: Ming Lei, linux-nvme
  Cc: Jens Axboe, Keith Busch, Long Li, Sagi Grimberg, Christoph Hellwig

On 11/7/2019 8:55 PM, Ming Lei wrote:
> f9dde187fa92("nvme-pci: remove cq check after submission") removes
> cq check after submission, this change actually causes performance
> regression on some NVMe drive in which single nvmeq handles requests
> originated from more than one blk-mq sw queues(call it multi-mapping
> queue).
> 
> Actually polling IO after submission can handle IO more efficiently,
> especially for multi-mapping queue:
> 
> 1) the poll itself is very cheap, and lockless check on cq is enough,
> see nvme_cqe_pending(). Especially the check can be done after batch
> submission is done.
> 
> 2) when IO completion is observed via the poll in submission, the requst
> may be completed without interrupt involved, or the interrupt handler
> overload can be decreased.
> 
> 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> is observed via this poll, the IO can be completed locally, which is
> cheaper than remote completion.
> 
> Follows test result done on Azure L80sv2 guest with NVMe drive(
> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> numa nodes, and each NVMe drive supports 8 hw queues.
> 
> 1) test script:
> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> 	--name=test --group_reporting --gtod_reduce=1
> 
> 2) test result:
>       | v5.3       | v5.3 with this patchset
> -------------------------------------------
> IOPS | 130K       | 424K
> 
> Given IO is handled more efficiently in this way, the original report
> of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> is applied. This issue is usually triggered when running IO from all
> CPUs concurrently.
> 
> I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> nodes), small improvement is observed on running the above fio over two
> NVMe drive with batch 1.
> 
> [1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com
> 
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Long Li <longli@microsoft.com>
> Fixes: f9dde187fa92("nvme-pci: remove cq check after submission")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/nvme/host/pci.c | 90 ++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 80 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5b20ab4d21d3..880376f43897 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -83,6 +83,7 @@ struct nvme_queue;
>   
>   static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
>   static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
> +static void nvme_poll_in_submission(struct nvme_queue *nvmeq);
>   
>   /*
>    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> @@ -165,7 +166,10 @@ struct nvme_queue {
>   	spinlock_t sq_lock;
>   	void *sq_cmds;
>   	 /* only used for poll queues: */
> -	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
> +	union {
> +		spinlock_t cq_poll_lock;
> +		spinlock_t cq_lock;
> +	}____cacheline_aligned_in_smp;

Is the new lock intended to protect anything differently than the old lock?

>   	volatile struct nvme_completion *cqes;
>   	struct blk_mq_tags **tags;
>   	dma_addr_t sq_dma_addr;
> @@ -185,6 +189,7 @@ struct nvme_queue {
>   #define NVMEQ_SQ_CMB		1
>   #define NVMEQ_DELETE_ERROR	2
>   #define NVMEQ_POLLED		3
> +#define NVMEQ_MULTI_MAPPING	4
>   	u32 *dbbuf_sq_db;
>   	u32 *dbbuf_cq_db;
>   	u32 *dbbuf_sq_ei;
> @@ -911,6 +916,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>   
>   	blk_mq_start_request(req);
>   	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> +
> +	if (bd->last)
> +		nvme_poll_in_submission(nvmeq);
> +
>   	return BLK_STS_OK;
>   out_unmap_data:
>   	nvme_unmap_data(dev, req);
> @@ -1016,6 +1025,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
>   	return found;
>   }
>   
> +static inline irqreturn_t
> +nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end)
> +{
> +	irqreturn_t ret = IRQ_NONE;
> +
> +	if (nvmeq->cq_head != nvmeq->last_cq_head)
> +		ret = IRQ_HANDLED;
> +	nvme_process_cq(nvmeq, start, end, -1);
> +	nvmeq->last_cq_head = nvmeq->cq_head;
> +
> +	return ret;
> +}
> +
>   static irqreturn_t nvme_irq(int irq, void *data)
>   {
>   	struct nvme_queue *nvmeq = data;
> @@ -1027,10 +1049,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
>   	 * the irq handler, even if that was on another CPU.
>   	 */
>   	rmb();
> -	if (nvmeq->cq_head != nvmeq->last_cq_head)
> -		ret = IRQ_HANDLED;
> -	nvme_process_cq(nvmeq, &start, &end, -1);
> -	nvmeq->last_cq_head = nvmeq->cq_head;
> +	ret = nvme_update_cq(nvmeq, &start, &end);
>   	wmb();
>   
>   	if (start != end) {
> @@ -1041,6 +1060,24 @@ static irqreturn_t nvme_irq(int irq, void *data)
>   	return ret;
>   }
>   
> +static irqreturn_t nvme_irq_multi_mapping(int irq, void *data)
> +{
> +	struct nvme_queue *nvmeq = data;
> +	irqreturn_t ret = IRQ_NONE;
> +	u16 start, end;
> +
> +	spin_lock(&nvmeq->cq_lock);
> +	ret = nvme_update_cq(nvmeq, &start, &end);
> +	spin_unlock(&nvmeq->cq_lock);
> +
> +	if (start != end) {
> +		nvme_complete_cqes(nvmeq, start, end);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return ret;
> +}
> +
>   static irqreturn_t nvme_irq_check(int irq, void *data)
>   {
>   	struct nvme_queue *nvmeq = data;
> @@ -1049,6 +1086,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>   	return IRQ_NONE;
>   }
>   
> +static void nvme_poll_in_submission(struct nvme_queue *nvmeq)
> +{
> +	if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) &&
> +			nvme_cqe_pending(nvmeq)) {
> +		unsigned long flags;
> +
> +		if (spin_trylock_irqsave(&nvmeq->cq_lock, flags)) {
> +			u16 start, end;
> +
> +			nvme_update_cq(nvmeq, &start, &end);
> +			spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
> +
> +			if (start != end)
> +				nvme_complete_cqes(nvmeq, start, end);
> +		}
> +	}
> +}

Just a nit, to me it reads simpler to return right away when the first test is 
false, rather than enclose the true path in an additional nesting level.

Thanks,
Ed

>   /*
>    * Poll for completions any queue, including those not dedicated to polling.
>    * Can be called from any context.
> @@ -1499,12 +1554,14 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>   {
>   	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>   	int nr = nvmeq->dev->ctrl.instance;
> +	irq_handler_t handler = test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) ?
> +		nvme_irq_multi_mapping : nvme_irq;
>   
>   	if (use_threaded_interrupts) {
>   		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +				handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>   	} else {
> -		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +		return pci_request_irq(pdev, nvmeq->cq_vector, handler,
>   				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>   	}
>   }
> @@ -1514,7 +1571,13 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
>   	struct nvme_dev *dev = nvmeq->dev;
>   
>   	spin_lock_init(&nvmeq->sq_lock);
> -	spin_lock_init(&nvmeq->cq_poll_lock);
> +
> +	if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags)) {
> +		WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
> +		spin_lock_init(&nvmeq->cq_lock);
> +	} else {
> +		spin_lock_init(&nvmeq->cq_poll_lock);
> +	}
>   
>   	nvmeq->sq_tail = 0;
>   	nvmeq->last_sq_tail = 0;
> @@ -1534,15 +1597,22 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
>   	u16 vector = 0;
>   
>   	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
> +	clear_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags);
>   
>   	/*
>   	 * A queue's vector matches the queue identifier unless the controller
>   	 * has only one vector available.
>   	 */
> -	if (!polled)
> +	if (!polled) {
> +		struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
>   		vector = dev->num_vecs == 1 ? 0 : qid;
> -	else
> +		if (vector && cpumask_weight(pci_irq_get_affinity(pdev,
> +						vector)) > 1)
> +			set_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags);
> +	} else {
>   		set_bit(NVMEQ_POLLED, &nvmeq->flags);
> +	}
>   
>   	result = adapter_alloc_cq(dev, qid, nvmeq, vector);
>   	if (result)
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 17:35       ` Sagi Grimberg
@ 2019-11-12 21:17         ` Long Li
  2019-11-12 23:44         ` Jens Axboe
  2019-11-13  2:47         ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: Long Li @ 2019-11-12 21:17 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme

>Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>mapping queue
>
>
>>>> f9dde187fa92("nvme-pci: remove cq check after submission") removes
>>>> cq check after submission, this change actually causes performance
>>>> regression on some NVMe drive in which single nvmeq handles requests
>>>> originated from more than one blk-mq sw queues(call it multi-mapping
>>>> queue).
>>>>
>>>> Actually polling IO after submission can handle IO more efficiently,
>>>> especially for multi-mapping queue:
>>>>
>>>> 1) the poll itself is very cheap, and lockless check on cq is
>>>> enough, see nvme_cqe_pending(). Especially the check can be done
>>>> after batch submission is done.
>>>>
>>>> 2) when IO completion is observed via the poll in submission, the
>>>> requst may be completed without interrupt involved, or the interrupt
>>>> handler overload can be decreased.
>>>>
>>>> 3) when single sw queue is submiting IOs to this hw queue, if IOs
>>>> completion is observed via this poll, the IO can be completed
>>>> locally, which is cheaper than remote completion.
>>>>
>>>> Follows test result done on Azure L80sv2 guest with NVMe drive(
>>>> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
>>>> numa nodes, and each NVMe drive supports 8 hw queues.
>>>
>>> I think that the cpu lockup is a different problem, and we should
>>> separate this patch from that problem..
>>
>> Why?
>>
>> Most of CPU lockup is a performance issue in essence. In theory,
>> improvement in IO path could alleviate the soft lockup.
>
>I don't think its a performance issue, being exposed to stall in hard irq is a
>fundamental issue. I don't see how this patch solves it.

With the patch, it's possible to process CQ on the CPU issuing I/O, effectively distributing the work to process CQ across multiple CPUs.

The original condition to trigger lockup is that multiple CPUs issuing I/O, and one CPU (where the HW queue will interrupt) processes CQ for all of them. With this patch and when I/O load is heavy, those issuing CPUs always have something to poll after I/O is issued. It's very difficult to overload the CPU that takes interrupts.

I have run tests for several days and couldn't repro the original lockup with the patch.

>
>>>> 1) test script:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
>>>> 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
>>>> 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
>>>> 	--name=test --group_reporting --gtod_reduce=1
>>>>
>>>> 2) test result:
>>>>        | v5.3       | v5.3 with this patchset
>>>> -------------------------------------------
>>>> IOPS | 130K       | 424K
>>>>
>>>> Given IO is handled more efficiently in this way, the original
>>>> report of CPU lockup[1] on Hyper-V can't be observed any more after
>>>> this patch is applied. This issue is usually triggered when running
>>>> IO from all CPUs concurrently.
>>>>
>>>
>>> This is just adding code that we already removed but in a more
>>> convoluted way...
>>
>> That commit removing the code actually causes regression for Azure
>> NVMe.
>
>This issue has been observed long before we removed the polling from the
>submission path and the cq_lock split.
>
>>> The correct place that should optimize the polling is aio/io_uring
>>> and not the driver locally IMO. Adding blk_poll to aio_getevents like
>>> io_uring would be a lot better I think..
>>
>> This poll is actually one-shot poll, and I shouldn't call it poll, and
>> it should have been called as 'check cq'.
>>
>> I believe it has been tried for supporting aio poll before, seems not
>> successful.
>
>Is there a fundamental reason why it can work for io_uring and cannot work
>for aio?
>
>>>> I also run test on Optane(32 hw queues) in big machine(96 cores, 2
>>>> numa nodes), small improvement is observed on running the above fio
>>>> over two NVMe drive with batch 1.
>>>
>>> Given that you add shared lock and atomic ops in the data path you
>>> are bound to hurt some latency oriented workloads in some way..
>>
>> The spin_trylock_irqsave() is just called in case that
>> nvme_cqe_pending() is true. My test on Optane doesn't show that latency
>is hurt.
>
>It also condition on the multi-mapping bit..
>
>Can we know for a fact that this doesn't hurt what-so-ever? If so, we should
>always do it, not conditionally do it. I would test this for io_uring test
>applications that are doing heavy polling. I think Jens had some benchmarks
>he used for how fast io_uring can go in a single cpu core...
>
>> However, I just found the Azure's NVMe is a bit special, in which the
>> 'Interrupt Coalescing' Feature register shows zero. But IO interrupt
>> is often triggered when there are many commands completed by drive.
>>
>> For example in fio test(4k, randread aio, single job), when IOPS is
>> 110K, interrupts per second is just 13~14K. When running heavy IO, the
>> interrupts per second can just reach 40~50K at most. And for normal
>> nvme drive, if 'Interrupt Coalescing' isn't used, most of times one
>> interrupt just complete one request in the rand IO test.
>>
>> That said Azure's implementation must apply aggressive interrupt
>> coalescing even though the register doesn't claim it.
>
>Did you check how many completions a reaped per interrupt?
>
>> That seems the root cause of soft lockup for Azure since lots of
>> requests may be handled in one interrupt event, especially when
>> interrupt event is delay-handled by CPU. Then it can explain why this
>> patch improves Azure NVNe so much in single job fio.
>>
>> But for other drives with N:1 mapping, the soft lockup risk still exists.
>
>As I said, we can discuss this as an optimization, but we should not consider
>this as a solution to the irq-stall issue reported on Azure as we agree that it
>doesn't solve the fundamental problem.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12  2:39       ` Ming Lei
  2019-11-12 16:25         ` Hannes Reinecke
@ 2019-11-12 21:20         ` Long Li
  2019-11-12 21:36           ` Keith Busch
  2019-11-13  2:24           ` Ming Lei
  1 sibling, 2 replies; 28+ messages in thread
From: Long Li @ 2019-11-12 21:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

>Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>mapping queue
>
>On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
>> >From: Christoph Hellwig <hch@lst.de>
>> >Sent: Monday, November 11, 2019 12:45 PM
>> >To: Ming Lei <ming.lei@redhat.com>
>> >Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>;
>> >Jens Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi
>> >Grimberg <sagi@grimberg.me>; Long Li <longli@microsoft.com>
>> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for
>> >multi- mapping queue
>> >
>> >On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>> >> f9dde187fa92("nvme-pci: remove cq check after submission") removes
>> >> cq check after submission, this change actually causes performance
>> >> regression on some NVMe drive in which single nvmeq handles
>> >> requests originated from more than one blk-mq sw queues(call it
>> >> multi-mapping queue).
>> >
>> >> Follows test result done on Azure L80sv2 guest with NVMe drive(
>> >> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
>> >> numa nodes, and each NVMe drive supports 8 hw queues.
>> >
>> >Have you actually seen this on a real nvme drive as well?
>> >
>> >Note that it is kinda silly to limit queues like that in VMs, so I
>> >really don't think we should optimize the driver for this particular case.
>>
>> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD
>(with 32 hardware queues). Tests also showed soft lockup when 32 queues
>are shared by 80 CPUs.
>>
>
>BTW, do you see if this simple change makes a difference?

Yes, I can confirm the patch fixed lockup on this VM configuration.  There is also no performance regression.

>
>> The issue will likely show up if the number of NVMe hardware queues is
>> less than the number of CPUs. I think this may be a likely
>> configuration on a very large system. (e.g. the largest VM on Azure
>> has 416 cores)
>>
>
>'the number of NVMe hardware queues' above should be the number of
>single NVMe drive.
>I believe 32 hw queues is common, also poll queues may take several from
>the total 32.
>When interrupt handling on single CPU core can't catch up with NVMe's IO
>handling, soft lockup could be triggered. Of course, there are lot kinds of
>supported processors by Linux.
>
>Also when (nr_nvme_drives * nr_nvme_hw_queues) > nr_cpu_cores, one
>same CPU can be assigned to handle more than 1 nvme IO queue interrupt
>from different NVMe drive, the situation becomes worse.
>
>
>Thanks,
>Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 21:20         ` Long Li
@ 2019-11-12 21:36           ` Keith Busch
  2019-11-13  0:50             ` Long Li
  2019-11-13  2:24           ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-11-12 21:36 UTC (permalink / raw)
  To: Long Li
  Cc: Jens Axboe, Sagi Grimberg, Christoph Hellwig, linux-nvme, Ming Lei

On Tue, Nov 12, 2019 at 09:20:27PM +0000, Long Li wrote:
> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
> >mapping queue
> >
> >On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
> >> >From: Christoph Hellwig <hch@lst.de>
> >> >Sent: Monday, November 11, 2019 12:45 PM
> >> >To: Ming Lei <ming.lei@redhat.com>
> >> >Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>;
> >> >Jens Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi
> >> >Grimberg <sagi@grimberg.me>; Long Li <longli@microsoft.com>
> >> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for
> >> >multi- mapping queue
> >> >
> >> >On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
> >> >> f9dde187fa92("nvme-pci: remove cq check after submission") removes
> >> >> cq check after submission, this change actually causes performance
> >> >> regression on some NVMe drive in which single nvmeq handles
> >> >> requests originated from more than one blk-mq sw queues(call it
> >> >> multi-mapping queue).
> >> >
> >> >> Follows test result done on Azure L80sv2 guest with NVMe drive(
> >> >> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> >> >> numa nodes, and each NVMe drive supports 8 hw queues.
> >> >
> >> >Have you actually seen this on a real nvme drive as well?
> >> >
> >> >Note that it is kinda silly to limit queues like that in VMs, so I
> >> >really don't think we should optimize the driver for this particular case.
> >>
> >> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD
> >(with 32 hardware queues). Tests also showed soft lockup when 32 queues
> >are shared by 80 CPUs.
> >>
> >
> >BTW, do you see if this simple change makes a difference?
> 
> Yes, I can confirm the patch fixed lockup on this VM configuration.  There is also no performance regression.

What if you just use threaded interrupts with the path that scheduels
the bottom-half on any CPU in the mask? Does that resolve lockup?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 17:35       ` Sagi Grimberg
  2019-11-12 21:17         ` Long Li
@ 2019-11-12 23:44         ` Jens Axboe
  2019-11-13  2:47         ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-11-12 23:44 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Keith Busch, Long Li, Christoph Hellwig, linux-nvme

On 11/12/19 9:35 AM, Sagi Grimberg wrote:
>>> The correct place that should optimize the polling is aio/io_uring and
>>> not the driver locally IMO. Adding blk_poll to aio_getevents like
>>> io_uring would be a lot better I think..
>>
>> This poll is actually one-shot poll, and I shouldn't call it poll, and
>> it should have been called as 'check cq'.
>>
>> I believe it has been tried for supporting aio poll before, seems not
>> successful.
> 
> Is there a fundamental reason why it can work for io_uring and cannot
> work for aio?

I did that work (called it aio-ring), you need at least a new system
call to support setting up an aio io_context to do polling. Since we now
have better alternatives, it'd be a waste of time to try and support it
with aio.

-- 
Jens Axboe

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 21:36           ` Keith Busch
@ 2019-11-13  0:50             ` Long Li
  0 siblings, 0 replies; 28+ messages in thread
From: Long Li @ 2019-11-13  0:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, Christoph Hellwig, linux-nvme, Ming Lei

>Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
>mapping queue
>
>On Tue, Nov 12, 2019 at 09:20:27PM +0000, Long Li wrote:
>> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for
>> >multi- mapping queue
>> >
>> >On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
>> >> >From: Christoph Hellwig <hch@lst.de>
>> >> >Sent: Monday, November 11, 2019 12:45 PM
>> >> >To: Ming Lei <ming.lei@redhat.com>
>> >> >Cc: linux-nvme@lists.infradead.org; Keith Busch
>> >> ><kbusch@kernel.org>; Jens Axboe <axboe@fb.com>; Christoph Hellwig
>> >> ><hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>; Long Li
>> >> ><longli@microsoft.com>
>> >> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission
>> >> >for
>> >> >multi- mapping queue
>> >> >
>> >> >On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
>> >> >> f9dde187fa92("nvme-pci: remove cq check after submission")
>> >> >> removes cq check after submission, this change actually causes
>> >> >> performance regression on some NVMe drive in which single nvmeq
>> >> >> handles requests originated from more than one blk-mq sw
>> >> >> queues(call it multi-mapping queue).
>> >> >
>> >> >> Follows test result done on Azure L80sv2 guest with NVMe drive(
>> >> >> Microsoft Corporation Device b111). This guest has 80 CPUs and
>> >> >> 10 numa nodes, and each NVMe drive supports 8 hw queues.
>> >> >
>> >> >Have you actually seen this on a real nvme drive as well?
>> >> >
>> >> >Note that it is kinda silly to limit queues like that in VMs, so I
>> >> >really don't think we should optimize the driver for this particular case.
>> >>
>> >> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD
>> >(with 32 hardware queues). Tests also showed soft lockup when 32
>> >queues are shared by 80 CPUs.
>> >>
>> >
>> >BTW, do you see if this simple change makes a difference?
>>
>> Yes, I can confirm the patch fixed lockup on this VM configuration.  There is
>also no performance regression.
>
>What if you just use threaded interrupts with the path that scheduels the
>bottom-half on any CPU in the mask? Does that resolve lockup?

Yes, that patch also fixed the soft lockup problem. But it also introduced a performance regression, the peak IOPS dropped 40%.

The reason is that I/O issuing processes (FIO in the tests) get more involuntary schedules.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 21:20         ` Long Li
  2019-11-12 21:36           ` Keith Busch
@ 2019-11-13  2:24           ` Ming Lei
  1 sibling, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-13  2:24 UTC (permalink / raw)
  To: Long Li
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, Sagi Grimberg

On Tue, Nov 12, 2019 at 09:20:27PM +0000, Long Li wrote:
> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-
> >mapping queue
> >
> >On Tue, Nov 12, 2019 at 12:33:50AM +0000, Long Li wrote:
> >> >From: Christoph Hellwig <hch@lst.de>
> >> >Sent: Monday, November 11, 2019 12:45 PM
> >> >To: Ming Lei <ming.lei@redhat.com>
> >> >Cc: linux-nvme@lists.infradead.org; Keith Busch <kbusch@kernel.org>;
> >> >Jens Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi
> >> >Grimberg <sagi@grimberg.me>; Long Li <longli@microsoft.com>
> >> >Subject: Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for
> >> >multi- mapping queue
> >> >
> >> >On Fri, Nov 08, 2019 at 11:55:08AM +0800, Ming Lei wrote:
> >> >> f9dde187fa92("nvme-pci: remove cq check after submission") removes
> >> >> cq check after submission, this change actually causes performance
> >> >> regression on some NVMe drive in which single nvmeq handles
> >> >> requests originated from more than one blk-mq sw queues(call it
> >> >> multi-mapping queue).
> >> >
> >> >> Follows test result done on Azure L80sv2 guest with NVMe drive(
> >> >> Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> >> >> numa nodes, and each NVMe drive supports 8 hw queues.
> >> >
> >> >Have you actually seen this on a real nvme drive as well?
> >> >
> >> >Note that it is kinda silly to limit queues like that in VMs, so I
> >> >really don't think we should optimize the driver for this particular case.
> >>
> >> I tested on an Azure L80s_v2 VM with newer Samsung P983 NVMe SSD
> >(with 32 hardware queues). Tests also showed soft lockup when 32 queues
> >are shared by 80 CPUs.
> >>
> >
> >BTW, do you see if this simple change makes a difference?
> 
> Yes, I can confirm the patch fixed lockup on this VM configuration.  There is also no performance regression.

Long, thanks for your update.

As I explained in last email[1], Azure's single job IO performance issue
& soft lockup is very specific to Hyper-V's NVMe implementation, which
actually applies aggressive interrupt coalescing.

Guys, I'd suggest to fix it via checking cq after submission for Azure first,
which can be implemented as a quirk.

We still need to understand the real reason for other NVMe soft lockup
reports, so far I just saw such report, not get chance to investigate it.


[1] http://lists.infradead.org/pipermail/linux-nvme/2019-November/027948.html


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 17:35       ` Sagi Grimberg
  2019-11-12 21:17         ` Long Li
  2019-11-12 23:44         ` Jens Axboe
@ 2019-11-13  2:47         ` Ming Lei
  2 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-13  2:47 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, Long Li, Christoph Hellwig, linux-nvme

On Tue, Nov 12, 2019 at 09:35:08AM -0800, Sagi Grimberg wrote:
> 
> > > > f9dde187fa92("nvme-pci: remove cq check after submission") removes
> > > > cq check after submission, this change actually causes performance
> > > > regression on some NVMe drive in which single nvmeq handles requests
> > > > originated from more than one blk-mq sw queues(call it multi-mapping
> > > > queue).
> > > > 
> > > > Actually polling IO after submission can handle IO more efficiently,
> > > > especially for multi-mapping queue:
> > > > 
> > > > 1) the poll itself is very cheap, and lockless check on cq is enough,
> > > > see nvme_cqe_pending(). Especially the check can be done after batch
> > > > submission is done.
> > > > 
> > > > 2) when IO completion is observed via the poll in submission, the requst
> > > > may be completed without interrupt involved, or the interrupt handler
> > > > overload can be decreased.
> > > > 
> > > > 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> > > > is observed via this poll, the IO can be completed locally, which is
> > > > cheaper than remote completion.
> > > > 
> > > > Follows test result done on Azure L80sv2 guest with NVMe drive(
> > > > Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> > > > numa nodes, and each NVMe drive supports 8 hw queues.
> > > 
> > > I think that the cpu lockup is a different problem, and we should
> > > separate this patch from that problem..
> > 
> > Why?
> > 
> > Most of CPU lockup is a performance issue in essence. In theory,
> > improvement in IO path could alleviate the soft lockup.
> 
> I don't think its a performance issue, being exposed to stall in hard
> irq is a fundamental issue. I don't see how this patch solves it.

As I mentioned, it is usually because CPU's interrupt handling can't catch up
with the interrupt events from hardware. Either the device generates
interrupt too fast, or the CPU isn't fast enough.

> 
> > > > 1) test script:
> > > > fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> > > > 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> > > > 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> > > > 	--name=test --group_reporting --gtod_reduce=1
> > > > 
> > > > 2) test result:
> > > >        | v5.3       | v5.3 with this patchset
> > > > -------------------------------------------
> > > > IOPS | 130K       | 424K
> > > > 
> > > > Given IO is handled more efficiently in this way, the original report
> > > > of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> > > > is applied. This issue is usually triggered when running IO from all
> > > > CPUs concurrently.
> > > > 
> > > 
> > > This is just adding code that we already removed but in a more
> > > convoluted way...
> > 
> > That commit removing the code actually causes regression for Azure
> > NVMe.
> 
> This issue has been observed long before we removed the polling from
> the submission path and the cq_lock split.
> 
> > > The correct place that should optimize the polling is aio/io_uring and
> > > not the driver locally IMO. Adding blk_poll to aio_getevents like
> > > io_uring would be a lot better I think..
> > 
> > This poll is actually one-shot poll, and I shouldn't call it poll, and
> > it should have been called as 'check cq'.
> > 
> > I believe it has been tried for supporting aio poll before, seems not
> > successful.
> 
> Is there a fundamental reason why it can work for io_uring and cannot
> work for aio?

Looks Jens has answered you.

> 
> > > > I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> > > > nodes), small improvement is observed on running the above fio over two
> > > > NVMe drive with batch 1.
> > > 
> > > Given that you add shared lock and atomic ops in the data path you are
> > > bound to hurt some latency oriented workloads in some way..
> > 
> > The spin_trylock_irqsave() is just called in case that nvme_cqe_pending() is
> > true. My test on Optane doesn't show that latency is hurt.
> 
> It also condition on the multi-mapping bit..
> 
> Can we know for a fact that this doesn't hurt what-so-ever? If so, we
> should always do it, not conditionally do it. I would test this for
> io_uring test applications that are doing heavy polling. I think

io_uring uses dedicated poll queue, which doesn't generate irq, so
not necessary to use this approach since there is poll already.

> Jens had some benchmarks he used for how fast io_uring can go in
> a single cpu core...

So far I plan to implement it as a quirk for azure's hardware since it is
because the nvme implementation applies aggressive interrupt coalescing.

The aggressive interrupt coalescing has already introduced big IO latency.

> 
> > However, I just found the Azure's NVMe is a bit special, in which
> > the 'Interrupt Coalescing' Feature register shows zero. But IO interrupt is
> > often triggered when there are many commands completed by drive.
> > 
> > For example in fio test(4k, randread aio, single job), when IOPS is
> > 110K, interrupts per second is just 13~14K. When running heavy IO, the
> > interrupts per second can just reach 40~50K at most. And for normal nvme
> > drive, if 'Interrupt Coalescing' isn't used, most of times one interrupt
> > just complete one request in the rand IO test.
> > 
> > That said Azure's implementation must apply aggressive interrupt coalescing
> > even though the register doesn't claim it.
> 
> Did you check how many completions a reaped per interrupt?

In the single job test, at average 8 compeletions per interrupt can be observed
since its IOPS is ~110K.

> 
> > That seems the root cause of soft lockup for Azure since lots of requests
> > may be handled in one interrupt event, especially when interrupt event is
> > delay-handled by CPU. Then it can explain why this patch improves Azure
> > NVNe so much in single job fio.
> > 
> > But for other drives with N:1 mapping, the soft lockup risk still exists.
> 
> As I said, we can discuss this as an optimization, but we should not
> consider this as a solution to the irq-stall issue reported on Azure as
> we agree that it doesn't solve the fundamental problem.

Azure's soft lockup is special, and really caused by aggressive interrupt
coalescing, and it has been verified that the patch can fix it, meantime
single job IOPS is improved much.

We'd still need to understand real reason of other soft lockup reports,
I saw two such RH reports on real hardware, but not get chance to investigate
them yet.


Thanks 
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 17:29             ` Hannes Reinecke
@ 2019-11-13  3:05               ` Ming Lei
  2019-11-13  3:17                 ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-11-13  3:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Keith Busch,
	Christoph Hellwig

On Tue, Nov 12, 2019 at 06:29:34PM +0100, Hannes Reinecke wrote:
> On 11/12/19 5:49 PM, Keith Busch wrote:
> > On Tue, Nov 12, 2019 at 05:25:59PM +0100, Hannes Reinecke wrote:
> > > (Nitpick: what does happen with the interrupt if we have a mask of
> > > several CPUs? Will the interrupt delivered to one CPU?
> > > To all in the mask?
> > 
> > The hard-interrupt will be delivered to effectively one of the CPUs in the
> > mask. The one that is selected is determined when the IRQ is allocated,
> > and it should try to select one form the mask that is least used (see
> > matrix_find_best_cpu_managed()).
> > 
> Yeah, just as I thought.
> Which also means that we need to redirect the irq to a non-busy cpu to avoid
> stalls under high load.
> Expecially if we have several NVMes to deal with.

IRQ matrix tries best to assign different effective CPU to each vector for
handling interrupt.

In theory, if (nr_nvme_drives * nr_nvme_hw_queues) < nr_cpu_cores, each
hw queue may be assigned to one single effective CPU for handling the
queue's interrupt. Otherwise, one CPU may be responsible for handling
interrupts from more than 1 drive's queues. But that is just in theory,
for example, irq matrix takes admin queues into account of managed IRQ.

On Azure, there are such cases, however soft lockup still can't be
triggered after applying checking cq in submission. That means one CPU
is enough to handle two hw queue's interrupt in this case. Again,
it depends both CPU and NVMe's drive.

For network, packets flood can come any time unlimitedly, however number
of in-flight storage requests is always limited, so the situation could
be much better for storage IO than network in which NAPI may avoid the issue.

> 
> > > Can't we implement blk_poll? Or maybe even threaded interrupts?
> > 
> > Threaded interrupts sound good. Currently, though, threaded interrupts
> > execute only on the same cpu as the hard irq. There was a proposal here to
> > change that to use any CPU in the mask, and I still think it makes sense
> > 
> >    http://lists.infradead.org/pipermail/linux-nvme/2019-August/026628.html
> > 
> That looks like just the ticket.
> In combination with threaded irqs and possibly blk_poll to avoid irq storms
> we should be good.

Threaded irq can't help Azure's performance, because Azure's nvme implementation
applies aggressive interrupt coalescing.

Thanks, 
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-13  3:05               ` Ming Lei
@ 2019-11-13  3:17                 ` Keith Busch
  2019-11-13  3:57                   ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-11-13  3:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Hannes Reinecke,
	Christoph Hellwig

On Wed, Nov 13, 2019 at 11:05:20AM +0800, Ming Lei wrote:
> Threaded irq can't help Azure's performance, because Azure's nvme implementation
> applies aggressive interrupt coalescing.

This sounds like QD1 latency is really awful if it's coalescing this way.

Let me restate to ensure I understand how this patch addresses the
high-depth case: by polling during submission, a multi-threaded high
depth workload gets to carve up the number of CQEs handled by spreading
across more CPUs. Real hardware should see fewer completions per
interrupt, but we should be able to recreate this issue by enabling
interrupt coalescing on a real NVMe.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-13  3:17                 ` Keith Busch
@ 2019-11-13  3:57                   ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-13  3:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Hannes Reinecke,
	Christoph Hellwig

On Wed, Nov 13, 2019 at 12:17:02PM +0900, Keith Busch wrote:
> On Wed, Nov 13, 2019 at 11:05:20AM +0800, Ming Lei wrote:
> > Threaded irq can't help Azure's performance, because Azure's nvme implementation
> > applies aggressive interrupt coalescing.
> 
> This sounds like QD1 latency is really awful if it's coalescing this way.
> 
> Let me restate to ensure I understand how this patch addresses the
> high-depth case: by polling during submission, a multi-threaded high
> depth workload gets to carve up the number of CQEs handled by spreading
> across more CPUs. Real hardware should see fewer completions per
> interrupt,

Right, cause submission's frequency is high enough.

> but we should be able to recreate this issue by enabling
> interrupt coalescing on a real NVMe.

Yeah, I just tried it, and similar intx/s and throughput can be re-created
on one real NVMe.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue
  2019-11-12 18:11   ` Nadolski, Edmund
@ 2019-11-13 13:46     ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2019-11-13 13:46 UTC (permalink / raw)
  To: Nadolski, Edmund
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Keith Busch,
	Christoph Hellwig

On Tue, Nov 12, 2019 at 11:11:40AM -0700, Nadolski, Edmund wrote:
> On 11/7/2019 8:55 PM, Ming Lei wrote:
> > f9dde187fa92("nvme-pci: remove cq check after submission") removes
> > cq check after submission, this change actually causes performance
> > regression on some NVMe drive in which single nvmeq handles requests
> > originated from more than one blk-mq sw queues(call it multi-mapping
> > queue).
> > 
> > Actually polling IO after submission can handle IO more efficiently,
> > especially for multi-mapping queue:
> > 
> > 1) the poll itself is very cheap, and lockless check on cq is enough,
> > see nvme_cqe_pending(). Especially the check can be done after batch
> > submission is done.
> > 
> > 2) when IO completion is observed via the poll in submission, the requst
> > may be completed without interrupt involved, or the interrupt handler
> > overload can be decreased.
> > 
> > 3) when single sw queue is submiting IOs to this hw queue, if IOs completion
> > is observed via this poll, the IO can be completed locally, which is
> > cheaper than remote completion.
> > 
> > Follows test result done on Azure L80sv2 guest with NVMe drive(
> > Microsoft Corporation Device b111). This guest has 80 CPUs and 10
> > numa nodes, and each NVMe drive supports 8 hw queues.
> > 
> > 1) test script:
> > fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
> > 	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> > 	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
> > 	--name=test --group_reporting --gtod_reduce=1
> > 
> > 2) test result:
> >       | v5.3       | v5.3 with this patchset
> > -------------------------------------------
> > IOPS | 130K       | 424K
> > 
> > Given IO is handled more efficiently in this way, the original report
> > of CPU lockup[1] on Hyper-V can't be observed any more after this patch
> > is applied. This issue is usually triggered when running IO from all
> > CPUs concurrently.
> > 
> > I also run test on Optane(32 hw queues) in big machine(96 cores, 2 numa
> > nodes), small improvement is observed on running the above fio over two
> > NVMe drive with batch 1.
> > 
> > [1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com
> > 
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: Jens Axboe <axboe@fb.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Long Li <longli@microsoft.com>
> > Fixes: f9dde187fa92("nvme-pci: remove cq check after submission")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   drivers/nvme/host/pci.c | 90 ++++++++++++++++++++++++++++++++++++-----
> >   1 file changed, 80 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 5b20ab4d21d3..880376f43897 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -83,6 +83,7 @@ struct nvme_queue;
> >   static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> >   static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
> > +static void nvme_poll_in_submission(struct nvme_queue *nvmeq);
> >   /*
> >    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> > @@ -165,7 +166,10 @@ struct nvme_queue {
> >   	spinlock_t sq_lock;
> >   	void *sq_cmds;
> >   	 /* only used for poll queues: */
> > -	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
> > +	union {
> > +		spinlock_t cq_poll_lock;
> > +		spinlock_t cq_lock;
> > +	}____cacheline_aligned_in_smp;
> 
> Is the new lock intended to protect anything differently than the old lock?

Yeah, cq_poll_lock is only used for poll queue, and cq_lock is only for
irq queue. And one queue can either be poll or irq queue.

> 
> >   	volatile struct nvme_completion *cqes;
> >   	struct blk_mq_tags **tags;
> >   	dma_addr_t sq_dma_addr;
> > @@ -185,6 +189,7 @@ struct nvme_queue {
> >   #define NVMEQ_SQ_CMB		1
> >   #define NVMEQ_DELETE_ERROR	2
> >   #define NVMEQ_POLLED		3
> > +#define NVMEQ_MULTI_MAPPING	4
> >   	u32 *dbbuf_sq_db;
> >   	u32 *dbbuf_cq_db;
> >   	u32 *dbbuf_sq_ei;
> > @@ -911,6 +916,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >   	blk_mq_start_request(req);
> >   	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> > +
> > +	if (bd->last)
> > +		nvme_poll_in_submission(nvmeq);
> > +
> >   	return BLK_STS_OK;
> >   out_unmap_data:
> >   	nvme_unmap_data(dev, req);
> > @@ -1016,6 +1025,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
> >   	return found;
> >   }
> > +static inline irqreturn_t
> > +nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end)
> > +{
> > +	irqreturn_t ret = IRQ_NONE;
> > +
> > +	if (nvmeq->cq_head != nvmeq->last_cq_head)
> > +		ret = IRQ_HANDLED;
> > +	nvme_process_cq(nvmeq, start, end, -1);
> > +	nvmeq->last_cq_head = nvmeq->cq_head;
> > +
> > +	return ret;
> > +}
> > +
> >   static irqreturn_t nvme_irq(int irq, void *data)
> >   {
> >   	struct nvme_queue *nvmeq = data;
> > @@ -1027,10 +1049,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
> >   	 * the irq handler, even if that was on another CPU.
> >   	 */
> >   	rmb();
> > -	if (nvmeq->cq_head != nvmeq->last_cq_head)
> > -		ret = IRQ_HANDLED;
> > -	nvme_process_cq(nvmeq, &start, &end, -1);
> > -	nvmeq->last_cq_head = nvmeq->cq_head;
> > +	ret = nvme_update_cq(nvmeq, &start, &end);
> >   	wmb();
> >   	if (start != end) {
> > @@ -1041,6 +1060,24 @@ static irqreturn_t nvme_irq(int irq, void *data)
> >   	return ret;
> >   }
> > +static irqreturn_t nvme_irq_multi_mapping(int irq, void *data)
> > +{
> > +	struct nvme_queue *nvmeq = data;
> > +	irqreturn_t ret = IRQ_NONE;
> > +	u16 start, end;
> > +
> > +	spin_lock(&nvmeq->cq_lock);
> > +	ret = nvme_update_cq(nvmeq, &start, &end);
> > +	spin_unlock(&nvmeq->cq_lock);
> > +
> > +	if (start != end) {
> > +		nvme_complete_cqes(nvmeq, start, end);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >   static irqreturn_t nvme_irq_check(int irq, void *data)
> >   {
> >   	struct nvme_queue *nvmeq = data;
> > @@ -1049,6 +1086,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
> >   	return IRQ_NONE;
> >   }
> > +static void nvme_poll_in_submission(struct nvme_queue *nvmeq)
> > +{
> > +	if (test_bit(NVMEQ_MULTI_MAPPING, &nvmeq->flags) &&
> > +			nvme_cqe_pending(nvmeq)) {
> > +		unsigned long flags;
> > +
> > +		if (spin_trylock_irqsave(&nvmeq->cq_lock, flags)) {
> > +			u16 start, end;
> > +
> > +			nvme_update_cq(nvmeq, &start, &end);
> > +			spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
> > +
> > +			if (start != end)
> > +				nvme_complete_cqes(nvmeq, start, end);
> > +		}
> > +	}
> > +}
> 
> Just a nit, to me it reads simpler to return right away when the first test
> is false, rather than enclose the true path in an additional nesting level.

This function will be inline, and the above may help to avoid
unnecessary stack variable allocation. 


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-11-13 13:47 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08  3:55 [PATCH 0/2] nvme-pci: improve IO performance via poll after batch submission Ming Lei
2019-11-08  3:55 ` [PATCH 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
2019-11-08  4:12   ` Keith Busch
2019-11-08  7:09     ` Ming Lei
2019-11-08  3:55 ` [PATCH 2/2] nvme-pci: poll IO after batch submission for multi-mapping queue Ming Lei
2019-11-11 20:44   ` Christoph Hellwig
2019-11-12  0:33     ` Long Li
2019-11-12  1:35       ` Sagi Grimberg
2019-11-12  2:39       ` Ming Lei
2019-11-12 16:25         ` Hannes Reinecke
2019-11-12 16:49           ` Keith Busch
2019-11-12 17:29             ` Hannes Reinecke
2019-11-13  3:05               ` Ming Lei
2019-11-13  3:17                 ` Keith Busch
2019-11-13  3:57                   ` Ming Lei
2019-11-12 21:20         ` Long Li
2019-11-12 21:36           ` Keith Busch
2019-11-13  0:50             ` Long Li
2019-11-13  2:24           ` Ming Lei
2019-11-12  2:07     ` Ming Lei
2019-11-12  1:44   ` Sagi Grimberg
2019-11-12  9:56     ` Ming Lei
2019-11-12 17:35       ` Sagi Grimberg
2019-11-12 21:17         ` Long Li
2019-11-12 23:44         ` Jens Axboe
2019-11-13  2:47         ` Ming Lei
2019-11-12 18:11   ` Nadolski, Edmund
2019-11-13 13:46     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).