Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
@ 2019-11-14  2:59 Ming Lei
  2019-11-14  2:59 ` [PATCH V3 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-14  2:59 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Nadolski Edmund,
	Keith Busch, Christoph Hellwig

Hi,

The two patches fix one performance regression on Microsoft Corporation device. The
root cause is that Microsoft device applies aggressive interrupt coalescing, so
single job fio performance drops much after we removes checking cq in
f9dde187fa92("nvme-pci: remove cq check after submission").

Turns out this issue is very specific on Microsoft device, so add
a quirk for checking CQ on this device.

V3:
	- replace spin_trylock_irqsave with spin_trylock_irq
	- fix comment on cq_lock & cq_poll_lock

V2:
	- only check CQ for Microsoft device.



Ming Lei (2):
  nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  nvme-pci: check CQ after batch submission for Microsoft device

 drivers/nvme/host/nvme.h |  6 +++
 drivers/nvme/host/pci.c  | 98 ++++++++++++++++++++++++++++++++++------
 2 files changed, 90 insertions(+), 14 deletions(-)

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V3 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
  2019-11-14  2:59 [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
@ 2019-11-14  2:59 ` Ming Lei
  2019-11-14  2:59 ` [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
  2019-11-21  3:11 ` [PATCH V3 0/2] " Ming Lei
  2 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-14  2:59 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Nadolski Edmund,
	Keith Busch, Christoph Hellwig

Prepare for adding new cq lock for improving IO performance in case
that device applies aggressive interrupt coalescing, and the new added
cq_lock can share space with cq_poll_lock becasue both are mutually
exclusive.

Move sq/cq_poll lock initialization into nvme_init_queue so that
we may initialize the added cq_lock correctly and lockdep warning
can be avoided.

This way is safe because no IO can be started until nvme_init_queue
is returned.

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 869f462e6b6e..01728409a7ea 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1481,8 +1481,6 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth)
 		goto free_cqdma;
 
 	nvmeq->dev = dev;
-	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_poll_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
@@ -1516,6 +1514,9 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 
+	spin_lock_init(&nvmeq->sq_lock);
+	spin_lock_init(&nvmeq->cq_poll_lock);
+
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
 	nvmeq->cq_head = 0;
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-14  2:59 [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
  2019-11-14  2:59 ` [PATCH V3 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
@ 2019-11-14  2:59 ` Ming Lei
  2019-11-14  4:56   ` Keith Busch
  2019-11-21  3:11 ` [PATCH V3 0/2] " Ming Lei
  2 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-14  2:59 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Ming Lei, Jens Axboe, Nadolski Edmund,
	Keith Busch, Christoph Hellwig

f9dde187fa92("nvme-pci: remove cq check after submission") removes
cq check after submission, this change actually causes performance
regression on Microsoft Corporation device in which aggressive interrupt
coalescing is used, for example, the max single interrupt per second
can reach 40~50K at most.

Checking CQ after submission can improve IO completion for Microsoft
Corporation device:

1) the check is very cheap, and lockless check on cq is enough,
see nvme_cqe_pending(). Especially the check can be done after batch
submission is done. Compared with the big latency added by aggressive
interrupt coalescing, cost of the check can be ignored.

2) when IO completion is observed via checking CQ in submission, the requst
may be completed without interrupt involved, or the interrupt handler
overload can be decreased.

3) when single sw queue is submiting IOs to this hw queue, if IOs completion
is observed via this check, the IO can be completed locally, which is
cheaper than remote completion.

Follows test result done on Azure L80sv2 guest with NVMe drive(
Microsoft Corporation Device b111). This guest has 80 CPUs and 10
numa nodes, and each NVMe drive supports 8 hw queues.

1) test script:
fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1 \
	--iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
	--direct=1 --runtime=30 --numjobs=1 --rw=randread \
	--name=test --group_reporting --gtod_reduce=1

2) test result:
     | v5.3       | v5.3 with this patchset
-------------------------------------------
IOPS | 130K       | 424K

Given IO is handled more efficiently in this way, the original report
of CPU lockup[1] on Hyper-V can't be observed any more after this patch
is applied, since this patch reduces IO completions much in interrupt
handler.

[1] https://lore.kernel.org/lkml/1566281669-48212-1-git-send-email-longli@linuxonhyperv.com

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Fixes: f9dde187fa92("nvme-pci: remove cq check after submission")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/nvme.h |  6 +++
 drivers/nvme/host/pci.c  | 95 ++++++++++++++++++++++++++++++++++------
 2 files changed, 88 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 22e8401352c2..5575827016d1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -115,6 +115,12 @@ enum nvme_quirks {
 	 * Prevent tag overlap between queues
 	 */
 	NVME_QUIRK_SHARED_TAGS                  = (1 << 13),
+
+	/*
+	 * Check CQ in submission for improving performance since
+	 * the device applies aggressive interrupt coalescing
+	 */
+	NVME_QUIRK_CHECK_CQ			= (1 << 14),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 01728409a7ea..15fd8574ee68 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -83,6 +83,7 @@ struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
+static void nvme_check_cq(struct nvme_queue *nvmeq);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -164,8 +165,11 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	void *sq_cmds;
-	 /* only used for poll queues: */
-	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
+
+	union {
+		spinlock_t cq_poll_lock; /* only used for poll queues */
+		spinlock_t cq_lock;	 /* only used for irq queues */
+	}____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -185,6 +189,7 @@ struct nvme_queue {
 #define NVMEQ_SQ_CMB		1
 #define NVMEQ_DELETE_ERROR	2
 #define NVMEQ_POLLED		3
+#define NVMEQ_CHECK_CQ		4
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -912,6 +917,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(req);
 	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
+
+	if (bd->last && test_bit(NVMEQ_CHECK_CQ, &nvmeq->flags))
+		nvme_check_cq(nvmeq);
+
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
@@ -1017,6 +1026,19 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
 	return found;
 }
 
+static inline irqreturn_t
+nvme_update_cq(struct nvme_queue *nvmeq, u16 *start, u16 *end)
+{
+	irqreturn_t ret = IRQ_NONE;
+
+	if (nvmeq->cq_head != nvmeq->last_cq_head)
+		ret = IRQ_HANDLED;
+	nvme_process_cq(nvmeq, start, end, -1);
+	nvmeq->last_cq_head = nvmeq->cq_head;
+
+	return ret;
+}
+
 static irqreturn_t nvme_irq(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
@@ -1028,10 +1050,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	 * the irq handler, even if that was on another CPU.
 	 */
 	rmb();
-	if (nvmeq->cq_head != nvmeq->last_cq_head)
-		ret = IRQ_HANDLED;
-	nvme_process_cq(nvmeq, &start, &end, -1);
-	nvmeq->last_cq_head = nvmeq->cq_head;
+	ret = nvme_update_cq(nvmeq, &start, &end);
 	wmb();
 
 	if (start != end) {
@@ -1042,6 +1061,24 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
+static irqreturn_t nvme_irq_for_check_cq(int irq, void *data)
+{
+	struct nvme_queue *nvmeq = data;
+	irqreturn_t ret = IRQ_NONE;
+	u16 start, end;
+
+	spin_lock(&nvmeq->cq_lock);
+	ret = nvme_update_cq(nvmeq, &start, &end);
+	spin_unlock(&nvmeq->cq_lock);
+
+	if (start != end) {
+		nvme_complete_cqes(nvmeq, start, end);
+		return IRQ_HANDLED;
+	}
+
+	return ret;
+}
+
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
@@ -1050,6 +1087,22 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
+static void nvme_check_cq(struct nvme_queue *nvmeq)
+{
+	if (!nvme_cqe_pending(nvmeq))
+		return;
+
+	if (spin_trylock_irq(&nvmeq->cq_lock)) {
+		u16 start, end;
+
+		nvme_update_cq(nvmeq, &start, &end);
+		spin_unlock_irq(&nvmeq->cq_lock);
+
+		if (start != end)
+			nvme_complete_cqes(nvmeq, start, end);
+	}
+}
+
 /*
  * Poll for completions any queue, including those not dedicated to polling.
  * Can be called from any context.
@@ -1500,12 +1553,14 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 {
 	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
 	int nr = nvmeq->dev->ctrl.instance;
+	irq_handler_t handler = test_bit(NVMEQ_CHECK_CQ, &nvmeq->flags) ?
+		nvme_irq_for_check_cq : nvme_irq;
 
 	if (use_threaded_interrupts) {
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+				handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	} else {
-		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
+		return pci_request_irq(pdev, nvmeq->cq_vector, handler,
 				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	}
 }
@@ -1515,7 +1570,13 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	struct nvme_dev *dev = nvmeq->dev;
 
 	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_poll_lock);
+
+	if (test_bit(NVMEQ_CHECK_CQ, &nvmeq->flags)) {
+		WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
+		spin_lock_init(&nvmeq->cq_lock);
+	} else {
+		spin_lock_init(&nvmeq->cq_poll_lock);
+	}
 
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
@@ -1528,22 +1589,27 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
-static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
+static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled,
+		bool check_cq)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 	u16 vector = 0;
 
 	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
+	clear_bit(NVMEQ_CHECK_CQ, &nvmeq->flags);
 
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	if (!polled)
+	if (!polled) {
 		vector = dev->num_vecs == 1 ? 0 : qid;
-	else
+		if (vector && check_cq)
+			set_bit(NVMEQ_CHECK_CQ, &nvmeq->flags);
+	} else {
 		set_bit(NVMEQ_POLLED, &nvmeq->flags);
+	}
 
 	result = adapter_alloc_cq(dev, qid, nvmeq, vector);
 	if (result)
@@ -1740,7 +1806,8 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	for (i = dev->online_queues; i <= max; i++) {
 		bool polled = i > rw_queues;
 
-		ret = nvme_create_queue(&dev->queues[i], i, polled);
+		ret = nvme_create_queue(&dev->queues[i], i, polled,
+				dev->ctrl.quirks & NVME_QUIRK_CHECK_CQ);
 		if (ret)
 			break;
 	}
@@ -3112,6 +3179,8 @@ static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_DEVICE(0x1cc1, 0x8201),   /* ADATA SX8200PNP 512GB */
 		.driver_data = NVME_QUIRK_NO_DEEPEST_PS |
 				NVME_QUIRK_IGNORE_DEV_SUBNQN, },
+	{ PCI_DEVICE(0x1414, 0xb111),   /* Microsoft Corporation device */
+		.driver_data = NVME_QUIRK_CHECK_CQ, },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-14  2:59 ` [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
@ 2019-11-14  4:56   ` Keith Busch
  2019-11-14  8:56     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2019-11-14  4:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Christoph Hellwig

On Thu, Nov 14, 2019 at 10:59:17AM +0800, Ming Lei wrote:
> @@ -912,6 +917,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  
>  	blk_mq_start_request(req);
>  	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> +
> +	if (bd->last && test_bit(NVMEQ_CHECK_CQ, &nvmeq->flags))
> +		nvme_check_cq(nvmeq);
> +

You've given this quirk a special completion handler, how about giving
it a special queue_rq as well. Then you can check the completion queue
during submission while holding the same lock as before.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-14  4:56   ` Keith Busch
@ 2019-11-14  8:56     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-14  8:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Christoph Hellwig

On Wed, Nov 13, 2019 at 09:56:43PM -0700, Keith Busch wrote:
> On Thu, Nov 14, 2019 at 10:59:17AM +0800, Ming Lei wrote:
> > @@ -912,6 +917,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  
> >  	blk_mq_start_request(req);
> >  	nvme_submit_cmd(nvmeq, &cmnd, bd->last);
> > +
> > +	if (bd->last && test_bit(NVMEQ_CHECK_CQ, &nvmeq->flags))
> > +		nvme_check_cq(nvmeq);
> > +
> 
> You've given this quirk a special completion handler, how about giving
> it a special queue_rq as well. Then you can check the completion queue
> during submission while holding the same lock as before.

We can do that, either we have to pass one extra 'check_cq' parameter to
common helper or duplicate a nvme_queue_rq_quick(). Not sure what benefit
we can get by this way, given test_bit(nvmeq->flags) is always close to
zero cost. 

Also if same lock is held in both submission & completion for the quirk,
the lock contention will be increased for the quirk since the lock has
to cover command submission, which shouldn't be needed.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-14  2:59 [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
  2019-11-14  2:59 ` [PATCH V3 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
  2019-11-14  2:59 ` [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
@ 2019-11-21  3:11 ` " Ming Lei
  2019-11-21  6:14   ` Christoph Hellwig
  2 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-21  3:11 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, Long Li, Jens Axboe, Nadolski Edmund, Keith Busch,
	Christoph Hellwig

On Thu, Nov 14, 2019 at 10:59:15AM +0800, Ming Lei wrote:
> Hi,
> 
> The two patches fix one performance regression on Microsoft Corporation device. The
> root cause is that Microsoft device applies aggressive interrupt coalescing, so
> single job fio performance drops much after we removes checking cq in
> f9dde187fa92("nvme-pci: remove cq check after submission").
> 
> Turns out this issue is very specific on Microsoft device, so add
> a quirk for checking CQ on this device.
> 
> V3:
> 	- replace spin_trylock_irqsave with spin_trylock_irq
> 	- fix comment on cq_lock & cq_poll_lock
> 
> V2:
> 	- only check CQ for Microsoft device.
> 
> 
> 
> Ming Lei (2):
>   nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue
>   nvme-pci: check CQ after batch submission for Microsoft device
> 
>  drivers/nvme/host/nvme.h |  6 +++
>  drivers/nvme/host/pci.c  | 98 ++++++++++++++++++++++++++++++++++------
>  2 files changed, 90 insertions(+), 14 deletions(-)
> 
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Long Li <longli@microsoft.com>
> -- 
> 2.20.1
> 

Hi Guys,

Ping...

Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-21  3:11 ` [PATCH V3 0/2] " Ming Lei
@ 2019-11-21  6:14   ` Christoph Hellwig
  2019-11-21  7:46     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2019-11-21  6:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Keith Busch, Christoph Hellwig

On Thu, Nov 21, 2019 at 11:11:54AM +0800, Ming Lei wrote:
> Hi Guys,
> 
> Ping...

I think everyone has told you that it is an invasive horrible hack
that papers of the underlying problem(s).  I'm not sure what more
we can do here.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-21  6:14   ` Christoph Hellwig
@ 2019-11-21  7:46     ` Ming Lei
  2019-11-21 15:45       ` Keith Busch
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-21  7:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Keith Busch

On Thu, Nov 21, 2019 at 07:14:31AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 21, 2019 at 11:11:54AM +0800, Ming Lei wrote:
> > Hi Guys,
> > 
> > Ping...
> 
> I think everyone has told you that it is an invasive horrible hack
> that papers of the underlying problem(s).  I'm not sure what more
> we can do here.

The problem is that the NVMe driver applies aggressive interrupt
coalescing clearly, that can be addressed exactly by this approach.

Also I don't understand why you think this patch is invasive and horrible.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-21  7:46     ` Ming Lei
@ 2019-11-21 15:45       ` Keith Busch
  2019-11-22  9:44         ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2019-11-21 15:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Christoph Hellwig

On Thu, Nov 21, 2019 at 03:46:43PM +0800, Ming Lei wrote:
> On Thu, Nov 21, 2019 at 07:14:31AM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 21, 2019 at 11:11:54AM +0800, Ming Lei wrote:
> > > Hi Guys,
> > > 
> > > Ping...
> > 
> > I think everyone has told you that it is an invasive horrible hack
> > that papers of the underlying problem(s).  I'm not sure what more
> > we can do here.
> 
> The problem is that the NVMe driver applies aggressive interrupt
> coalescing clearly, that can be addressed exactly by this approach.

Can this default coalescing setting be turned off with a "set feature"
command?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-21 15:45       ` Keith Busch
@ 2019-11-22  9:44         ` Ming Lei
  2019-11-22  9:57           ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-22  9:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Christoph Hellwig

On Fri, Nov 22, 2019 at 12:45:31AM +0900, Keith Busch wrote:
> On Thu, Nov 21, 2019 at 03:46:43PM +0800, Ming Lei wrote:
> > On Thu, Nov 21, 2019 at 07:14:31AM +0100, Christoph Hellwig wrote:
> > > On Thu, Nov 21, 2019 at 11:11:54AM +0800, Ming Lei wrote:
> > > > Hi Guys,
> > > > 
> > > > Ping...
> > > 
> > > I think everyone has told you that it is an invasive horrible hack
> > > that papers of the underlying problem(s).  I'm not sure what more
> > > we can do here.
> > 
> > The problem is that the NVMe driver applies aggressive interrupt
> > coalescing clearly, that can be addressed exactly by this approach.
> 
> Can this default coalescing setting be turned off with a "set feature"
> command?
> 

At default, 'get feature -f 0x8' shows zero, and nothing changes after
running 'set feature -f 0x8 -v 0'.

BTW, soft lockup from another Samsung NVMe can be fixed by this patch
too. I am confirming if the Samsung NVMe applies aggressive interrupt
coalescing too.

Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22  9:44         ` Ming Lei
@ 2019-11-22  9:57           ` Christoph Hellwig
  2019-11-22 10:25             ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2019-11-22  9:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Keith Busch, Thomas Gleixner, Christoph Hellwig

On Fri, Nov 22, 2019 at 05:44:57PM +0800, Ming Lei wrote:
> > Can this default coalescing setting be turned off with a "set feature"
> > command?
> > 
> 
> At default, 'get feature -f 0x8' shows zero, and nothing changes after
> running 'set feature -f 0x8 -v 0'.
> 
> BTW, soft lockup from another Samsung NVMe can be fixed by this patch
> too. I am confirming if the Samsung NVMe applies aggressive interrupt
> coalescing too.

I think we are missing up a few things here, and just polling the
completion queue from submission context isn't the right answer for
either.

The aggressive interrupt coalescing and resulting long run times of
the irq handler really just means we need to stop processing them from
hard irq context at all.  NVMe already has support for threaded
interrupts and we need to make that the default (and possibly even
the only option).  At that point we can do a cond_resched() in this
handler to avoid soft lockups.

The next problem is drivers with less completion queues than cpu cores,
as that will still overload the one cpu that the interrupt handler was
assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
handler that can be used for round robin scheduling, but that probably
won't help with getting good performance.  The other idea is to use
"virtual" completion queues.  NVMe allows free form command ids, so
we could OR an index for the relative cpu number inside this queue
into the command id and and then create one interrupt thread for
each of them.  Although I'd like to hear from Thomas on what he thinks
of multiple threads per hardirq first.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22  9:57           ` Christoph Hellwig
@ 2019-11-22 10:25             ` Ming Lei
  2019-11-22 14:04               ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-22 10:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Long Li, linux-nvme, Jens Axboe, Nadolski Edmund,
	Keith Busch, Thomas Gleixner

On Fri, Nov 22, 2019 at 10:57:43AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 22, 2019 at 05:44:57PM +0800, Ming Lei wrote:
> > > Can this default coalescing setting be turned off with a "set feature"
> > > command?
> > > 
> > 
> > At default, 'get feature -f 0x8' shows zero, and nothing changes after
> > running 'set feature -f 0x8 -v 0'.
> > 
> > BTW, soft lockup from another Samsung NVMe can be fixed by this patch
> > too. I am confirming if the Samsung NVMe applies aggressive interrupt
> > coalescing too.
> 
> I think we are missing up a few things here, and just polling the
> completion queue from submission context isn't the right answer for
> either.
> 
> The aggressive interrupt coalescing and resulting long run times of
> the irq handler really just means we need to stop processing them from
> hard irq context at all.  NVMe already has support for threaded
> interrupts and we need to make that the default (and possibly even
> the only option).  At that point we can do a cond_resched() in this
> handler to avoid soft lockups.

I am pretty sure that threaded interrupt will cause performance drop a lot,
and Long has verified that.

> 
> The next problem is drivers with less completion queues than cpu cores,

IRQ matrix has balanced interrupt load among CPUs already. For example,
on Azure, 8 cpu cores can be mapped to one hctx, however, there are only
several CPUs which handles interrupts from at most two queues.

> as that will still overload the one cpu that the interrupt handler was
> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt

Actually one CPU is fast enough to handle several drive's interrupt handling.
Also there is per-queue depth limit, and the interrupt flood issue in network
can't be serious on storage.

So far I only got three NVMe softlock tickets, two of them can be fixed by this
patch, and another one turns out a IOMMU lock issue. Are there other
NVMe softlock report?

At least Azure's NVMe applies aggressive interrupt coalescing, in which
the softlock is usually triggered by interrupt delay handling, then
there can be lots of requests to be handled in single interrupt handler
some times. It doesn't mean CPU is saturated, and it is just sort of
interrupt peak. And it is very specific with Azure's implementation.

Otherwise, this simple patch can't fix the issue.

> handler that can be used for round robin scheduling, but that probably
> won't help with getting good performance.  The other idea is to use
> "virtual" completion queues.  NVMe allows free form command ids, so
> we could OR an index for the relative cpu number inside this queue
> into the command id and and then create one interrupt thread for
> each of them.  Although I'd like to hear from Thomas on what he thinks
> of multiple threads per hardirq first.

As mentioned above, there isn't proof it is caused by interrupt loading
saturating the CPU, so not sure we need to re-invent a wheel.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22 10:25             ` Ming Lei
@ 2019-11-22 14:04               ` Jens Axboe
  2019-11-22 21:49                 ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2019-11-22 14:04 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Sagi Grimberg, Long Li, linux-nvme, Nadolski Edmund, Keith Busch,
	Thomas Gleixner

On 11/22/19 3:25 AM, Ming Lei wrote:
>> as that will still overload the one cpu that the interrupt handler was
>> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
> 
> Actually one CPU is fast enough to handle several drive's interrupt handling.
> Also there is per-queue depth limit, and the interrupt flood issue in network
> can't be serious on storage.

This is true today, but it won't be true in the future. Lets aim for a
solution that's a little more future proof than just "enough today", if
we're going to make changes in this area.

-- 
Jens Axboe

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22 14:04               ` Jens Axboe
@ 2019-11-22 21:49                 ` Ming Lei
  2019-11-22 21:58                   ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-22 21:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Long Li, linux-nvme, Nadolski Edmund, Keith Busch,
	Thomas Gleixner, Christoph Hellwig

On Fri, Nov 22, 2019 at 02:04:52PM +0000, Jens Axboe wrote:
> On 11/22/19 3:25 AM, Ming Lei wrote:
> >> as that will still overload the one cpu that the interrupt handler was
> >> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
> > 
> > Actually one CPU is fast enough to handle several drive's interrupt handling.
> > Also there is per-queue depth limit, and the interrupt flood issue in network
> > can't be serious on storage.
> 
> This is true today, but it won't be true in the future. Lets aim for a
> solution that's a little more future proof than just "enough today", if
> we're going to make changes in this area.

That should be a new feature for future hardware, and we don't know any
performance details, and it can be hard to prepare for it now. Maybe
such hardware or case never comes:

- storage device has queue depth, which limits the max in-flight requests
to be handled in each queue's interrupt handler.

- Suppose such fast hardware comes, it isn't reasonable for them
to support N:1 mapping(N is big).

- Also IRQ matrix has balanced interrupt handling loading already, that
said most of times, one CPU is just responsible for handing one hw queue's
interrupt. Even in Azure's case, 8 CPUs are mapped to one hw queue, but
there is just several CPUs which is for responsible for at most 2 hw queues.

So could we focus on now and fix the regression first?

Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22 21:49                 ` Ming Lei
@ 2019-11-22 21:58                   ` Jens Axboe
  2019-11-22 22:30                     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2019-11-22 21:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Long Li, linux-nvme, Nadolski Edmund, Keith Busch,
	Thomas Gleixner, Christoph Hellwig

On 11/22/19 2:49 PM, Ming Lei wrote:
> On Fri, Nov 22, 2019 at 02:04:52PM +0000, Jens Axboe wrote:
>> On 11/22/19 3:25 AM, Ming Lei wrote:
>>>> as that will still overload the one cpu that the interrupt handler was
>>>> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
>>>
>>> Actually one CPU is fast enough to handle several drive's interrupt handling.
>>> Also there is per-queue depth limit, and the interrupt flood issue in network
>>> can't be serious on storage.
>>
>> This is true today, but it won't be true in the future. Lets aim for a
>> solution that's a little more future proof than just "enough today", if
>> we're going to make changes in this area.
> 
> That should be a new feature for future hardware, and we don't know any
> performance details, and it can be hard to prepare for it now. Maybe
> such hardware or case never comes:

Oh it'll surely come, and maybe sooner than you think. My point is that
using "one CPU is fast enough to handle several drive interrupts" is
very shortsighted, and probably not even true today.

> - storage device has queue depth, which limits the max in-flight requests
> to be handled in each queue's interrupt handler.

Only if new requests aren't also coming in and completing while you are
doing that work.
> 
> - Suppose such fast hardware comes, it isn't reasonable for them
> to support N:1 mapping(N is big).

Very true, in fact that's already pretty damn dumb today...

> - Also IRQ matrix has balanced interrupt handling loading already, that
> said most of times, one CPU is just responsible for handing one hw queue's
> interrupt. Even in Azure's case, 8 CPUs are mapped to one hw queue, but
> there is just several CPUs which is for responsible for at most 2 hw queues.
> 
> So could we focus on now and fix the regression first?

As far as I could tell from the other message, sounds like they both
have broken interrupt coalescing? Makes it harder to care, honestly...

But yes, I think we should do something about this. This really isn't a
new issue, if a core gets overloaded just doing completions from
interrupts, we should punt the work. NAPI has been doing that for ages,
and the block layer also used to have support it, but nobody used it.
Would be a great idea to make a blk-mq friendly version of that, with
the kinds of IOPS and latencies in mind that we see today and in the
coming years. I don't think hacking around this in the nvme driver is a
very good way to go about it.

-- 
Jens Axboe

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device
  2019-11-22 21:58                   ` Jens Axboe
@ 2019-11-22 22:30                     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-22 22:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Long Li, linux-nvme, Nadolski Edmund, Keith Busch,
	Thomas Gleixner, Christoph Hellwig

On Fri, Nov 22, 2019 at 09:58:36PM +0000, Jens Axboe wrote:
> On 11/22/19 2:49 PM, Ming Lei wrote:
> > On Fri, Nov 22, 2019 at 02:04:52PM +0000, Jens Axboe wrote:
> >> On 11/22/19 3:25 AM, Ming Lei wrote:
> >>>> as that will still overload the one cpu that the interrupt handler was
> >>>> assigned to.  A dumb fix would be a cpu mask for the threaded interrupt
> >>>
> >>> Actually one CPU is fast enough to handle several drive's interrupt handling.
> >>> Also there is per-queue depth limit, and the interrupt flood issue in network
> >>> can't be serious on storage.
> >>
> >> This is true today, but it won't be true in the future. Lets aim for a
> >> solution that's a little more future proof than just "enough today", if
> >> we're going to make changes in this area.
> > 
> > That should be a new feature for future hardware, and we don't know any
> > performance details, and it can be hard to prepare for it now. Maybe
> > such hardware or case never comes:
> 
> Oh it'll surely come, and maybe sooner than you think. My point is that
> using "one CPU is fast enough to handle several drive interrupts" is
> very shortsighted, and probably not even true today.

That single CPU is responsible for handling more than one drives should
only happen in case that the following condition is true:

	nr_drives * nr_io_hw_queue > nr_cpus

> 
> > - storage device has queue depth, which limits the max in-flight requests
> > to be handled in each queue's interrupt handler.
> 
> Only if new requests aren't also coming in and completing while you are
> doing that work.
> > 
> > - Suppose such fast hardware comes, it isn't reasonable for them
> > to support N:1 mapping(N is big).
> 
> Very true, in fact that's already pretty damn dumb today...

OK, I guess it is because lots of NVMe only supports limited hw
queues(32).

> 
> > - Also IRQ matrix has balanced interrupt handling loading already, that
> > said most of times, one CPU is just responsible for handing one hw queue's
> > interrupt. Even in Azure's case, 8 CPUs are mapped to one hw queue, but
> > there is just several CPUs which is for responsible for at most 2 hw queues.

It also depends on how many drives are used on single machine. The issue
is possible only when the number of drives is big enough. I guess it
isn't unusual.

> > 
> > So could we focus on now and fix the regression first?
> 
> As far as I could tell from the other message, sounds like they both
> have broken interrupt coalescing? Makes it harder to care, honestly...

Yeah, I found two reports on two different drives, both can
be fixed by this patch. Not see other reports which is caused by
too much interrupt loading on single CPU. That is why I tried to
avoid generic approach...

> 
> But yes, I think we should do something about this. This really isn't a
> new issue, if a core gets overloaded just doing completions from
> interrupts, we should punt the work. NAPI has been doing that for ages,
> and the block layer also used to have support it, but nobody used it.
> Would be a great idea to make a blk-mq friendly version of that, with
> the kinds of IOPS and latencies in mind that we see today and in the
> coming years. I don't think hacking around this in the nvme driver is a
> very good way to go about it.

OK, I will look at this approach, and Sagi has posted one such patch.

thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-14  2:59 [PATCH V3 0/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
2019-11-14  2:59 ` [PATCH V3 1/2] nvme-pci: move sq/cq_poll lock initialization into nvme_init_queue Ming Lei
2019-11-14  2:59 ` [PATCH V3 2/2] nvme-pci: check CQ after batch submission for Microsoft device Ming Lei
2019-11-14  4:56   ` Keith Busch
2019-11-14  8:56     ` Ming Lei
2019-11-21  3:11 ` [PATCH V3 0/2] " Ming Lei
2019-11-21  6:14   ` Christoph Hellwig
2019-11-21  7:46     ` Ming Lei
2019-11-21 15:45       ` Keith Busch
2019-11-22  9:44         ` Ming Lei
2019-11-22  9:57           ` Christoph Hellwig
2019-11-22 10:25             ` Ming Lei
2019-11-22 14:04               ` Jens Axboe
2019-11-22 21:49                 ` Ming Lei
2019-11-22 21:58                   ` Jens Axboe
2019-11-22 22:30                     ` Ming Lei

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git