All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
@ 2016-10-05  9:42 Sagi Grimberg
  2016-10-05  9:42 ` [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle Sagi Grimberg
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Currently we have a couple of problems with our completion processing scheme:

1. We abuse the polling context by doing it from hard-irq (which is
why the threaded interrupts mode was introduced). We can possibly
stay there for too long causing us to hard-lockup (can be triggered
easily by running heavy randread workloads on systems with lots of
cpu cores).

2. We lack fairness between completion queues that share the same
MSI/MSIX assignment (completion queues that belong to different
devices). We need to drain a completion queue completely before
we can process another completion queue.

irq-poll service solves both by correctly budgeting the completion
processing contexts and keeping per-cpu queues of completion queues.
By using it we can reduce the number of overall nvme interrupts in
the system which is a bonus.

I ran some tests with this and it seemed to work pretty well with
my low-end nvme devices. One phenomenon I've encountered was that
for single core long queue-depth'ed randread workload I saw around
~8-10% iops decrease. However when running multi-core IO I didn't
see any noticeable performance degradation. non-polling Canonical
randread latency doesn't seem to be affected as well. And also
polling mode IO is not affected as expected.

So in addition for review and feedback, this is a call for testing
and benchmarking as this touches the critical data path.

Sagi Grimberg (6):
  nvme-pci: Split __nvme_process_cq to poll and handle
  nvme-pci: Add budget to __nvme_process_cq
  nvme-pci: Use irq-poll for completion processing
  nvme: don't consume cq in queue_rq
  nvme-pci: open-code polling logic in nvme_poll
  nvme-pci: Get rid of threaded interrupts

 drivers/nvme/host/Kconfig |   1 +
 drivers/nvme/host/pci.c   | 179 ++++++++++++++++++++++++++--------------------
 2 files changed, 101 insertions(+), 79 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:21   ` Johannes Thumshirn
  2016-10-05  9:42 ` [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq Sagi Grimberg
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Just some rework to split the logic and make it slightly
more readable. This will help us to easily add the irq-poll
logic.

We remove the cqe_seen indication as a preparation for
irq-poll where we will schedule soft-irq context for polling
so we should consider the interrupt as handled always.

Also, introduce nvme_ring_cq_doorbell helper to mask out the
cq_vector validity check.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/pci.c | 117 +++++++++++++++++++++++++-----------------------
 1 file changed, 62 insertions(+), 55 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcf5a960951..1d8b35a319d0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,7 @@ struct nvme_dev;
 struct nvme_queue;
 
 static int nvme_reset(struct nvme_dev *dev);
-static void nvme_process_cq(struct nvme_queue *nvmeq);
+static int nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
 /*
@@ -130,7 +130,6 @@ struct nvme_queue {
 	u16 cq_head;
 	u16 qid;
 	u8 cq_phase;
-	u8 cqe_seen;
 };
 
 /*
@@ -655,83 +654,91 @@ static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
 	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
 }
 
-static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
+static inline void nvme_ring_cq_doorbell(struct nvme_queue *nvmeq)
 {
-	u16 head, phase;
+	if (likely(nvmeq->cq_vector >= 0))
+		writel(nvmeq->cq_head, nvmeq->q_db + nvmeq->dev->db_stride);
+}
 
-	head = nvmeq->cq_head;
-	phase = nvmeq->cq_phase;
+static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
+		struct nvme_completion *cqe)
+{
+	struct request *req;
 
-	while (nvme_cqe_valid(nvmeq, head, phase)) {
-		struct nvme_completion cqe = nvmeq->cqes[head];
-		struct request *req;
+	if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
+		dev_warn(nvmeq->dev->ctrl.device,
+			"invalid id %d completed on queue %d\n",
+			cqe->command_id, le16_to_cpu(cqe->sq_id));
+		return;
+	}
 
-		if (++head == nvmeq->q_depth) {
-			head = 0;
-			phase = !phase;
-		}
+	/*
+	 * AEN requests are special as they don't time out and can
+	 * survive any kind of queue freeze and often don't respond to
+	 * aborts.  We don't even bother to allocate a struct request
+	 * for them but rather special case them here.
+	 */
+	if (unlikely(nvmeq->qid == 0 &&
+			cqe->command_id >= NVME_AQ_BLKMQ_DEPTH)) {
+		nvme_complete_async_event(&nvmeq->dev->ctrl, cqe);
+		return;
+	}
 
-		if (tag && *tag == cqe.command_id)
-			*tag = -1;
+	req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);
+	if (req->cmd_type == REQ_TYPE_DRV_PRIV && req->special)
+		memcpy(req->special, cqe, sizeof(*cqe));
+	blk_mq_complete_request(req, le16_to_cpu(cqe->status) >> 1);
+}
 
-		if (unlikely(cqe.command_id >= nvmeq->q_depth)) {
-			dev_warn(nvmeq->dev->ctrl.device,
-				"invalid id %d completed on queue %d\n",
-				cqe.command_id, le16_to_cpu(cqe.sq_id));
-			continue;
-		}
+static inline int nvme_read_cqe(struct nvme_queue *nvmeq,
+		struct nvme_completion *cqe)
+{
+	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
+		*cqe = nvmeq->cqes[nvmeq->cq_head];
 
-		/*
-		 * AEN requests are special as they don't time out and can
-		 * survive any kind of queue freeze and often don't respond to
-		 * aborts.  We don't even bother to allocate a struct request
-		 * for them but rather special case them here.
-		 */
-		if (unlikely(nvmeq->qid == 0 &&
-				cqe.command_id >= NVME_AQ_BLKMQ_DEPTH)) {
-			nvme_complete_async_event(&nvmeq->dev->ctrl, &cqe);
-			continue;
+		if (++nvmeq->cq_head == nvmeq->q_depth) {
+			nvmeq->cq_head = 0;
+			nvmeq->cq_phase = !nvmeq->cq_phase;
 		}
+		return 1;
+	}
+	return 0;
+}
 
-		req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
-		if (req->cmd_type == REQ_TYPE_DRV_PRIV && req->special)
-			memcpy(req->special, &cqe, sizeof(cqe));
-		blk_mq_complete_request(req, le16_to_cpu(cqe.status) >> 1);
+static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
+{
+	struct nvme_completion cqe;
+	int consumed = 0;
 
-	}
+	while (nvme_read_cqe(nvmeq, &cqe)) {
+		nvme_handle_cqe(nvmeq, &cqe);
 
-	/* If the controller ignores the cq head doorbell and continuously
-	 * writes to the queue, it is theoretically possible to wrap around
-	 * the queue twice and mistakenly return IRQ_NONE.  Linux only
-	 * requires that 0.1% of your interrupts are handled, so this isn't
-	 * a big problem.
-	 */
-	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
-		return;
+		if (tag && *tag == cqe.command_id) {
+			*tag = -1;
+			break;
+		}
+	}
 
-	if (likely(nvmeq->cq_vector >= 0))
-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
-	nvmeq->cq_head = head;
-	nvmeq->cq_phase = phase;
+	if (consumed)
+		nvme_ring_cq_doorbell(nvmeq);
 
-	nvmeq->cqe_seen = 1;
+	return consumed;
 }
 
-static void nvme_process_cq(struct nvme_queue *nvmeq)
+static int nvme_process_cq(struct nvme_queue *nvmeq)
 {
-	__nvme_process_cq(nvmeq, NULL);
+	return __nvme_process_cq(nvmeq, NULL);
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
 {
-	irqreturn_t result;
 	struct nvme_queue *nvmeq = data;
+
 	spin_lock(&nvmeq->q_lock);
 	nvme_process_cq(nvmeq);
-	result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
-	nvmeq->cqe_seen = 0;
 	spin_unlock(&nvmeq->q_lock);
-	return result;
+
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t nvme_irq_check(int irq, void *data)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
  2016-10-05  9:42 ` [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:26   ` Johannes Thumshirn
  2016-10-05  9:42 ` [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing Sagi Grimberg
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Prepare for irq-poll logic where we budget the completion queue
processing context.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/pci.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1d8b35a319d0..85bfd76fbee9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -705,7 +705,7 @@ static inline int nvme_read_cqe(struct nvme_queue *nvmeq,
 	return 0;
 }
 
-static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
+static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget, int *tag)
 {
 	struct nvme_completion cqe;
 	int consumed = 0;
@@ -713,6 +713,9 @@ static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
 	while (nvme_read_cqe(nvmeq, &cqe)) {
 		nvme_handle_cqe(nvmeq, &cqe);
 
+		if (++consumed == budget)
+			break;
+
 		if (tag && *tag == cqe.command_id) {
 			*tag = -1;
 			break;
@@ -727,7 +730,7 @@ static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
 
 static int nvme_process_cq(struct nvme_queue *nvmeq)
 {
-	return __nvme_process_cq(nvmeq, NULL);
+	return __nvme_process_cq(nvmeq, INT_MAX, NULL);
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
@@ -755,7 +758,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 
 	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
 		spin_lock_irq(&nvmeq->q_lock);
-		__nvme_process_cq(nvmeq, &tag);
+		__nvme_process_cq(nvmeq, INT_MAX, &tag);
 		spin_unlock_irq(&nvmeq->q_lock);
 
 		if (tag == -1)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
  2016-10-05  9:42 ` [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle Sagi Grimberg
  2016-10-05  9:42 ` [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:40   ` Johannes Thumshirn
  2016-10-05  9:42 ` [PATCH rfc 4/6] nvme: don't consume cq in queue_rq Sagi Grimberg
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


irq-poll interface allows us to implement correct
cq polling in terms of completion context processing
(helps drivers to _not_ abuse irq context) and maintain
fairness between multiple completion queues that share
the same MSI/MSIX assignments. It is done by mainataining
per-cpu completion processing contexts queues and dispatching
the completion processing routines a budget that they
need to obay.

Also, select IRQ_POLL in Kconfig as we rely on it now.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/Kconfig |  1 +
 drivers/nvme/host/pci.c   | 28 +++++++++++++++++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index db39d53cdfb9..1523fdaff537 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
 	tristate "NVM Express block device"
 	depends on PCI && BLOCK
 	select NVME_CORE
+	select IRQ_POLL
 	---help---
 	  The NVM Express driver is for solid state drives directly
 	  connected to the PCI or PCI Express bus.  If you know you
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 85bfd76fbee9..f37fad10007f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -42,6 +42,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <asm/unaligned.h>
+#include <linux/irq_poll.h>
 
 #include "nvme.h"
 
@@ -49,6 +50,7 @@
 #define NVME_AQ_DEPTH		256
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
+#define NVME_POLL_BUDGET_IRQ	256
 		
 /*
  * We handle AEN commands ourselves and don't even let the
@@ -130,6 +132,7 @@ struct nvme_queue {
 	u16 cq_head;
 	u16 qid;
 	u8 cq_phase;
+	struct irq_poll	iop;
 };
 
 /*
@@ -733,13 +736,30 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 	return __nvme_process_cq(nvmeq, INT_MAX, NULL);
 }
 
+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+	struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, iop);
+	int completed;
+
+	spin_lock_irq(&nvmeq->q_lock);
+	completed = __nvme_process_cq(nvmeq, budget, NULL);
+	if (completed < budget) {
+		irq_poll_complete(&nvmeq->iop);
+		enable_irq(nvmeq->dev->entry[nvmeq->cq_vector].vector);
+	}
+	spin_unlock_irq(&nvmeq->q_lock);
+
+	return completed;
+}
+
 static irqreturn_t nvme_irq(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
 
-	spin_lock(&nvmeq->q_lock);
-	nvme_process_cq(nvmeq);
-	spin_unlock(&nvmeq->q_lock);
+	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
+		disable_irq_nosync(irq);
+		irq_poll_sched(&nvmeq->iop);
+	}
 
 	return IRQ_HANDLED;
 }
@@ -937,6 +957,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
+	irq_poll_disable(&nvmeq->iop);
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
 	if (nvmeq->sq_cmds)
@@ -1066,6 +1087,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
 			dev->ctrl.instance, qid);
 	spin_lock_init(&nvmeq->q_lock);
+	irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, nvme_irqpoll_handler);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 4/6] nvme: don't consume cq in queue_rq
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
                   ` (2 preceding siblings ...)
  2016-10-05  9:42 ` [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:42   ` Johannes Thumshirn
  2016-10-05  9:42 ` [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll Sagi Grimberg
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Now that we have irq-poll we don't need to flow
control our queues from the submission path. It
doesn't make sense to completely drain the completion
queue for each submit (or budgeting it either).

Also, now we can remove the forward declaration of
nvme_process_cq.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/pci.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f37fad10007f..ba448fb755be 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,7 +71,6 @@ struct nvme_dev;
 struct nvme_queue;
 
 static int nvme_reset(struct nvme_dev *dev);
-static int nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
 /*
@@ -612,7 +611,6 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto out;
 	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
-	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return BLK_MQ_RQ_QUEUE_OK;
 out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
                   ` (3 preceding siblings ...)
  2016-10-05  9:42 ` [PATCH rfc 4/6] nvme: don't consume cq in queue_rq Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:52   ` Johannes Thumshirn
  2016-10-05  9:42 ` [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts Sagi Grimberg
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Given that the code is simple enough it seems better
then passing a tag by reference for each call site.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/pci.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba448fb755be..a1de66f13e80 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -706,7 +706,7 @@ static inline int nvme_read_cqe(struct nvme_queue *nvmeq,
 	return 0;
 }
 
-static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget, int *tag)
+static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget)
 {
 	struct nvme_completion cqe;
 	int consumed = 0;
@@ -716,11 +716,6 @@ static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget, int *tag)
 
 		if (++consumed == budget)
 			break;
-
-		if (tag && *tag == cqe.command_id) {
-			*tag = -1;
-			break;
-		}
 	}
 
 	if (consumed)
@@ -731,7 +726,7 @@ static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget, int *tag)
 
 static int nvme_process_cq(struct nvme_queue *nvmeq)
 {
-	return __nvme_process_cq(nvmeq, INT_MAX, NULL);
+	return __nvme_process_cq(nvmeq, INT_MAX);
 }
 
 static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
@@ -740,7 +735,7 @@ static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
 	int completed;
 
 	spin_lock_irq(&nvmeq->q_lock);
-	completed = __nvme_process_cq(nvmeq, budget, NULL);
+	completed = __nvme_process_cq(nvmeq, budget);
 	if (completed < budget) {
 		irq_poll_complete(&nvmeq->iop);
 		enable_irq(nvmeq->dev->entry[nvmeq->cq_vector].vector);
@@ -773,17 +768,28 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_completion cqe;
+	int found = 0, consumed = 0;
 
-	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
-		spin_lock_irq(&nvmeq->q_lock);
-		__nvme_process_cq(nvmeq, INT_MAX, &tag);
-		spin_unlock_irq(&nvmeq->q_lock);
+	if (!nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase))
+		return 0;
+
+	spin_lock_irq(&nvmeq->q_lock);
+	while (nvme_read_cqe(nvmeq, &cqe)) {
+		nvme_handle_cqe(nvmeq, &cqe);
+		consumed++;
 
-		if (tag == -1)
-			return 1;
+		if (tag == cqe.command_id) {
+			found = 1;
+			break;
+		}
 	}
 
-	return 0;
+	if (consumed)
+		nvme_ring_cq_doorbell(nvmeq);
+	spin_unlock_irq(&nvmeq->q_lock);
+
+	return found;
 }
 
 static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl, int aer_idx)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
                   ` (4 preceding siblings ...)
  2016-10-05  9:42 ` [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll Sagi Grimberg
@ 2016-10-05  9:42 ` Sagi Grimberg
  2016-10-05 13:47   ` Johannes Thumshirn
  2016-10-05 21:47 ` [PATCH rfc 0/6] convert nvme pci to use irq-poll service Keith Busch
  2016-10-27 10:09 ` Johannes Thumshirn
  7 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05  9:42 UTC (permalink / raw)


Now that we use irq-poll and will never call bio_endio
from hard irq context we don't need it.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/pci.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a1de66f13e80..07248b5c14ce 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -58,9 +58,6 @@
  */
 #define NVME_AQ_BLKMQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AERS)
 
-static int use_threaded_interrupts;
-module_param(use_threaded_interrupts, int, 0);
-
 static bool use_cmb_sqes = true;
 module_param(use_cmb_sqes, bool, 0644);
 MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes");
@@ -757,14 +754,6 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t nvme_irq_check(int irq, void *data)
-{
-	struct nvme_queue *nvmeq = data;
-	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase))
-		return IRQ_WAKE_THREAD;
-	return IRQ_NONE;
-}
-
 static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
@@ -1114,10 +1103,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 							const char *name)
 {
-	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
 	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
 				IRQF_SHARED, name, nvmeq);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle
  2016-10-05  9:42 ` [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle Sagi Grimberg
@ 2016-10-05 13:21   ` Johannes Thumshirn
  2016-10-05 16:52     ` Sagi Grimberg
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:21 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:09PM +0300, Sagi Grimberg wrote:
> Just some rework to split the logic and make it slightly
> more readable. This will help us to easily add the irq-poll
> logic.
> 
> We remove the cqe_seen indication as a preparation for
> irq-poll where we will schedule soft-irq context for polling
> so we should consider the interrupt as handled always.
> 
> Also, introduce nvme_ring_cq_doorbell helper to mask out the
> cq_vector validity check.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---

[...]

> +static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
> +{
> +	struct nvme_completion cqe;
> +	int consumed = 0;
>  
> -	}
> +	while (nvme_read_cqe(nvmeq, &cqe)) {
> +		nvme_handle_cqe(nvmeq, &cqe);
>  
> -	/* If the controller ignores the cq head doorbell and continuously
> -	 * writes to the queue, it is theoretically possible to wrap around
> -	 * the queue twice and mistakenly return IRQ_NONE.  Linux only
> -	 * requires that 0.1% of your interrupts are handled, so this isn't
> -	 * a big problem.
> -	 */
> -	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
> -		return;
> +		if (tag && *tag == cqe.command_id) {
> +			*tag = -1;
> +			break;
> +		}
> +	}
>  
> -	if (likely(nvmeq->cq_vector >= 0))
> -		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> -	nvmeq->cq_head = head;
> -	nvmeq->cq_phase = phase;
> +	if (consumed)
> +		nvme_ring_cq_doorbell(nvmeq);
>  
> -	nvmeq->cqe_seen = 1;
> +	return consumed;
>  }

Won't 'consumed' always be 0 here and we thus never call
nvme_ring_cq_doorbell()? Am I overlooking something here, or is this
just for preparation of later patches?

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq
  2016-10-05  9:42 ` [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq Sagi Grimberg
@ 2016-10-05 13:26   ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:26 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:10PM +0300, Sagi Grimberg wrote:
> Prepare for irq-poll logic where we budget the completion queue
> processing context.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/pci.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1d8b35a319d0..85bfd76fbee9 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -705,7 +705,7 @@ static inline int nvme_read_cqe(struct nvme_queue *nvmeq,
>  	return 0;
>  }
>  
> -static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
> +static int __nvme_process_cq(struct nvme_queue *nvmeq, int budget, int *tag)
>  {
>  	struct nvme_completion cqe;
>  	int consumed = 0;
> @@ -713,6 +713,9 @@ static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
>  	while (nvme_read_cqe(nvmeq, &cqe)) {
>  		nvme_handle_cqe(nvmeq, &cqe);
>  
> +		if (++consumed == budget)
> +			break;
> +

Ah here it is. Wouldn't it make sense to add a ++consumed dummy statement in
the previous patch (or maybe even a 'if (++consumed == INT_MAX) break;')? Just
a suggestion.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing
  2016-10-05  9:42 ` [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing Sagi Grimberg
@ 2016-10-05 13:40   ` Johannes Thumshirn
  2016-10-05 16:57     ` Sagi Grimberg
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:40 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:11PM +0300, Sagi Grimberg wrote:
> irq-poll interface allows us to implement correct
> cq polling in terms of completion context processing
> (helps drivers to _not_ abuse irq context) and maintain
> fairness between multiple completion queues that share
> the same MSI/MSIX assignments. It is done by mainataining
> per-cpu completion processing contexts queues and dispatching
> the completion processing routines a budget that they
> need to obay.
> 
> Also, select IRQ_POLL in Kconfig as we rely on it now.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/Kconfig |  1 +
>  drivers/nvme/host/pci.c   | 28 +++++++++++++++++++++++++---
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index db39d53cdfb9..1523fdaff537 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -5,6 +5,7 @@ config BLK_DEV_NVME
>  	tristate "NVM Express block device"
>  	depends on PCI && BLOCK
>  	select NVME_CORE
> +	select IRQ_POLL
>  	---help---
>  	  The NVM Express driver is for solid state drives directly
>  	  connected to the PCI or PCI Express bus.  If you know you
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 85bfd76fbee9..f37fad10007f 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -42,6 +42,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <asm/unaligned.h>
> +#include <linux/irq_poll.h>
>  
>  #include "nvme.h"
>  
> @@ -49,6 +50,7 @@
>  #define NVME_AQ_DEPTH		256
>  #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
>  #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
> +#define NVME_POLL_BUDGET_IRQ	256

Is there a reason for the 256 or is it just a nicely suited value? Especially
as you've been using INT_MAX as a budget for CQ processing before (which I
think is kinda high).

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 4/6] nvme: don't consume cq in queue_rq
  2016-10-05  9:42 ` [PATCH rfc 4/6] nvme: don't consume cq in queue_rq Sagi Grimberg
@ 2016-10-05 13:42   ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:42 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:12PM +0300, Sagi Grimberg wrote:
> Now that we have irq-poll we don't need to flow
> control our queues from the submission path. It
> doesn't make sense to completely drain the completion
> queue for each submit (or budgeting it either).
> 
> Also, now we can remove the forward declaration of
> nvme_process_cq.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts
  2016-10-05  9:42 ` [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts Sagi Grimberg
@ 2016-10-05 13:47   ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:47 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:14PM +0300, Sagi Grimberg wrote:
> Now that we use irq-poll and will never call bio_endio
> from hard irq context we don't need it.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll
  2016-10-05  9:42 ` [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll Sagi Grimberg
@ 2016-10-05 13:52   ` Johannes Thumshirn
  2016-10-05 17:02     ` Sagi Grimberg
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-05 13:52 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:13PM +0300, Sagi Grimberg wrote:
> Given that the code is simple enough it seems better
> then passing a tag by reference for each call site.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/pci.c | 36 +++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 

I don't really see how the call by value vs. call by reference change is a
better trade-off than open coding __nvme_process_cq(), as we'd have to
duplicate any possible fix for this function then (yes I see it's quite
simple but I have some doubt it's 100% bugfree).

Thanks,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle
  2016-10-05 13:21   ` Johannes Thumshirn
@ 2016-10-05 16:52     ` Sagi Grimberg
  2016-10-05 19:49       ` Jon Derrick
  2016-10-10  7:25       ` Johannes Thumshirn
  0 siblings, 2 replies; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05 16:52 UTC (permalink / raw)



>> +static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
>> +{
>> +	struct nvme_completion cqe;
>> +	int consumed = 0;
>>
>> -	}
>> +	while (nvme_read_cqe(nvmeq, &cqe)) {
>> +		nvme_handle_cqe(nvmeq, &cqe);
>>
>> -	/* If the controller ignores the cq head doorbell and continuously
>> -	 * writes to the queue, it is theoretically possible to wrap around
>> -	 * the queue twice and mistakenly return IRQ_NONE.  Linux only
>> -	 * requires that 0.1% of your interrupts are handled, so this isn't
>> -	 * a big problem.
>> -	 */
>> -	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
>> -		return;
>> +		if (tag && *tag == cqe.command_id) {
>> +			*tag = -1;
>> +			break;
>> +		}
>> +	}
>>
>> -	if (likely(nvmeq->cq_vector >= 0))
>> -		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
>> -	nvmeq->cq_head = head;
>> -	nvmeq->cq_phase = phase;
>> +	if (consumed)
>> +		nvme_ring_cq_doorbell(nvmeq);
>>
>> -	nvmeq->cqe_seen = 1;
>> +	return consumed;
>>  }
>
> Won't 'consumed' always be 0 here and we thus never call
> nvme_ring_cq_doorbell()? Am I overlooking something here, or is this
> just for preparation of later patches?

that incrementation was lost in a squash. I'll restore it in v2.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing
  2016-10-05 13:40   ` Johannes Thumshirn
@ 2016-10-05 16:57     ` Sagi Grimberg
  2016-10-10  7:47       ` Johannes Thumshirn
  0 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05 16:57 UTC (permalink / raw)



>> @@ -49,6 +50,7 @@
>>  #define NVME_AQ_DEPTH		256
>>  #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
>>  #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
>> +#define NVME_POLL_BUDGET_IRQ	256
>
> Is there a reason for the 256 or is it just a nicely suited value?

Umm, No good reason behind it, I used this value before for irq-poll
and it seemed to fit best. We can try other budgets or have it
configurable. The point is to not abuse the soft-irq context for too
long and maintain fairness between completion queues so we just need
a reasonable value.


> Especially
> as you've been using INT_MAX as a budget for CQ processing before (which I
> think is kinda high).

That was just for keeping the existing logic before we change to
irq-poll.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll
  2016-10-05 13:52   ` Johannes Thumshirn
@ 2016-10-05 17:02     ` Sagi Grimberg
  2016-10-10  7:49       ` Johannes Thumshirn
  0 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05 17:02 UTC (permalink / raw)



>> Given that the code is simple enough it seems better
>> then passing a tag by reference for each call site.
>>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>>  drivers/nvme/host/pci.c | 36 +++++++++++++++++++++---------------
>>  1 file changed, 21 insertions(+), 15 deletions(-)
>>
>
> I don't really see how the call by value vs. call by reference change is a
> better trade-off than open coding __nvme_process_cq(), as we'd have to
> duplicate any possible fix for this function then (yes I see it's quite
> simple but I have some doubt it's 100% bugfree).

I just figured it'd be nicer to remove the tag that we carry in all
the call sites because we want to reuse the code in nvme_poll(). It
made better sense before, but now that its simpler (for me at least)
we can lose it.

Having said that, we can easily remove this patch if we're not all
on board...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle
  2016-10-05 16:52     ` Sagi Grimberg
@ 2016-10-05 19:49       ` Jon Derrick
  2016-10-10  7:25       ` Johannes Thumshirn
  1 sibling, 0 replies; 27+ messages in thread
From: Jon Derrick @ 2016-10-05 19:49 UTC (permalink / raw)


Looks good assuming you will restore the increment in v2!

Acked-by Jon Derrick: <jonathan.derrick at intel.com>

On Wed, Oct 05, 2016@07:52:02PM +0300, Sagi Grimberg wrote:
> 
> >>+static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
> >>+{
> >>+	struct nvme_completion cqe;
> >>+	int consumed = 0;
> >>
> >>-	}
> >>+	while (nvme_read_cqe(nvmeq, &cqe)) {
> >>+		nvme_handle_cqe(nvmeq, &cqe);
> >>
> >>-	/* If the controller ignores the cq head doorbell and continuously
> >>-	 * writes to the queue, it is theoretically possible to wrap around
> >>-	 * the queue twice and mistakenly return IRQ_NONE.  Linux only
> >>-	 * requires that 0.1% of your interrupts are handled, so this isn't
> >>-	 * a big problem.
> >>-	 */
> >>-	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
> >>-		return;
> >>+		if (tag && *tag == cqe.command_id) {
> >>+			*tag = -1;
> >>+			break;
> >>+		}
> >>+	}
> >>
> >>-	if (likely(nvmeq->cq_vector >= 0))
> >>-		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> >>-	nvmeq->cq_head = head;
> >>-	nvmeq->cq_phase = phase;
> >>+	if (consumed)
> >>+		nvme_ring_cq_doorbell(nvmeq);
> >>
> >>-	nvmeq->cqe_seen = 1;
> >>+	return consumed;
> >> }
> >
> >Won't 'consumed' always be 0 here and we thus never call
> >nvme_ring_cq_doorbell()? Am I overlooking something here, or is this
> >just for preparation of later patches?
> 
> that incrementation was lost in a squash. I'll restore it in v2.
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
                   ` (5 preceding siblings ...)
  2016-10-05  9:42 ` [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts Sagi Grimberg
@ 2016-10-05 21:47 ` Keith Busch
  2016-10-05 21:55   ` Sagi Grimberg
  2016-10-27 10:09 ` Johannes Thumshirn
  7 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2016-10-05 21:47 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:08PM +0300, Sagi Grimberg wrote:
> I ran some tests with this and it seemed to work pretty well with
> my low-end nvme devices. One phenomenon I've encountered was that
> for single core long queue-depth'ed randread workload I saw around
> ~8-10% iops decrease. However when running multi-core IO I didn't
> see any noticeable performance degradation. non-polling Canonical
> randread latency doesn't seem to be affected as well. And also
> polling mode IO is not affected as expected.
> 
> So in addition for review and feedback, this is a call for testing
> and benchmarking as this touches the critical data path.

Hi Sagi,

Just reconfirming your findings with another data point, I ran this on
controllers with 3D Xpoint media, and single depth 4k random read latency
increased almost 7%. I'll try see if there's anything else we can do to
bring that in.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-05 21:47 ` [PATCH rfc 0/6] convert nvme pci to use irq-poll service Keith Busch
@ 2016-10-05 21:55   ` Sagi Grimberg
  2016-10-05 22:49     ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05 21:55 UTC (permalink / raw)



>> I ran some tests with this and it seemed to work pretty well with
>> my low-end nvme devices. One phenomenon I've encountered was that
>> for single core long queue-depth'ed randread workload I saw around
>> ~8-10% iops decrease. However when running multi-core IO I didn't
>> see any noticeable performance degradation. non-polling Canonical
>> randread latency doesn't seem to be affected as well. And also
>> polling mode IO is not affected as expected.
>>
>> So in addition for review and feedback, this is a call for testing
>> and benchmarking as this touches the critical data path.
>
> Hi Sagi,
>
> Just reconfirming your findings with another data point, I ran this on
> controllers with 3D Xpoint media, and single depth 4k random read latency
> increased almost 7%. I'll try see if there's anything else we can do to
> bring that in.

Actually I didn't notice latency increase with QD=1 but I'm using
low-end devices so I might have missed it.
Did you use libaio or psync (for polling mode)?

I'm a bit surprised that scheduling soft-irq (on the same core) is
so expensive (the networking folks are using it all over...)
Perhaps we need to look into napi and see if we're doing something
wrong there...

I wander if we kept the cq processing in queue_rq but budget it to
something normally balanced? maybe poll budget to 4 completions?

Does this have any effect with your Xpoint?
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 150941a1a730..28b33f518a3d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -608,6 +608,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                 goto out;
         }
         __nvme_submit_cmd(nvmeq, &cmnd);
+       __nvme_process_cq(nvmeq, 4);
         spin_unlock_irq(&nvmeq->q_lock);
         return BLK_MQ_RQ_QUEUE_OK;
  out:
--

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-05 22:49     ` Keith Busch
@ 2016-10-05 22:48       ` Sagi Grimberg
  0 siblings, 0 replies; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-05 22:48 UTC (permalink / raw)



>> Actually I didn't notice latency increase with QD=1 but I'm using
>> low-end devices so I might have missed it.
>> Did you use libaio or psync (for polling mode)?
>
> I used 'sync' ioengine for random read. For sequential, I used 'dd',
> and it showed the same difference.
>
> If I use pvsync2 with --hipri in fio, then I see little to no difference.
>
>> I'm a bit surprised that scheduling soft-irq (on the same core) is
>> so expensive (the networking folks are using it all over...)
>> Perhaps we need to look into napi and see if we're doing something
>> wrong there...
>
> The latency increase I observed was ~.5 microseconds. That may get lost
> in the noise for most controllers. Maybe that's within the expected
> increase using this soft-irq method?

Oh, I didn't realize the difference was that small. Perhaps that makes
better sense (although starting to look at irq-poll more closely I think
there is room for micro optimizations).

Did you happen to test long depth IO? I'm more concerned on that end...

>> I wander if we kept the cq processing in queue_rq but budget it to
>> something normally balanced? maybe poll budget to 4 completions?
>>
>> Does this have any effect with your Xpoint?
>
> No difference with that.

Yea, it was a long shot...

> I was actually never sure about having this opprotunistic polling in the
> IO submission path. For one, it can create higher latency outliers if
> a different controller's interrupt affinitized to the same CPU posted
> a completion first.

In the current scheme this opportunistic polling is mandatory as a
self-flow-control as without it nothing is preventing nvme_irq to keep
consuming completions forever (if multiple cpu cores keeps pounding the
queue with IO).

This patch set aims to tackle that too.

> Another thing we can do if we don't poll for completions in the submission
> path is split the single q_lock into submission and completion locks. Then
> we don't have to disable irq's in the submission path. I saw that demo'ed
> a while ago, and it was a small micro-improvement, but never saw the
> patch proposed on the public lists..

That'd definitely help!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-05 21:55   ` Sagi Grimberg
@ 2016-10-05 22:49     ` Keith Busch
  2016-10-05 22:48       ` Sagi Grimberg
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2016-10-05 22:49 UTC (permalink / raw)


On Thu, Oct 06, 2016@12:55:07AM +0300, Sagi Grimberg wrote:
> 
> > > I ran some tests with this and it seemed to work pretty well with
> > > my low-end nvme devices. One phenomenon I've encountered was that
> > > for single core long queue-depth'ed randread workload I saw around
> > > ~8-10% iops decrease. However when running multi-core IO I didn't
> > > see any noticeable performance degradation. non-polling Canonical
> > > randread latency doesn't seem to be affected as well. And also
> > > polling mode IO is not affected as expected.
> > > 
> > > So in addition for review and feedback, this is a call for testing
> > > and benchmarking as this touches the critical data path.
> > 
> > Hi Sagi,
> > 
> > Just reconfirming your findings with another data point, I ran this on
> > controllers with 3D Xpoint media, and single depth 4k random read latency
> > increased almost 7%. I'll try see if there's anything else we can do to
> > bring that in.
> 
> Actually I didn't notice latency increase with QD=1 but I'm using
> low-end devices so I might have missed it.
> Did you use libaio or psync (for polling mode)?

I used 'sync' ioengine for random read. For sequential, I used 'dd',
and it showed the same difference.

If I use pvsync2 with --hipri in fio, then I see little to no difference.

> I'm a bit surprised that scheduling soft-irq (on the same core) is
> so expensive (the networking folks are using it all over...)
> Perhaps we need to look into napi and see if we're doing something
> wrong there...

The latency increase I observed was ~.5 microseconds. That may get lost
in the noise for most controllers. Maybe that's within the expected
increase using this soft-irq method?
 
> I wander if we kept the cq processing in queue_rq but budget it to
> something normally balanced? maybe poll budget to 4 completions?
>
> Does this have any effect with your Xpoint?

No difference with that.

I was actually never sure about having this opprotunistic polling in the
IO submission path. For one, it can create higher latency outliers if
a different controller's interrupt affinitized to the same CPU posted
a completion first.

Another thing we can do if we don't poll for completions in the submission
path is split the single q_lock into submission and completion locks. Then
we don't have to disable irq's in the submission path. I saw that demo'ed
a while ago, and it was a small micro-improvement, but never saw the
patch proposed on the public lists..

> --
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 150941a1a730..28b33f518a3d 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -608,6 +608,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>                 goto out;
>         }
>         __nvme_submit_cmd(nvmeq, &cmnd);
> +       __nvme_process_cq(nvmeq, 4);
>         spin_unlock_irq(&nvmeq->q_lock);
>         return BLK_MQ_RQ_QUEUE_OK;
>  out:
> --

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle
  2016-10-05 16:52     ` Sagi Grimberg
  2016-10-05 19:49       ` Jon Derrick
@ 2016-10-10  7:25       ` Johannes Thumshirn
  1 sibling, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-10  7:25 UTC (permalink / raw)


On Wed, Oct 05, 2016@07:52:02PM +0300, Sagi Grimberg wrote:
> 
> > > +static int __nvme_process_cq(struct nvme_queue *nvmeq, int *tag)
> > > +{
> > > +	struct nvme_completion cqe;
> > > +	int consumed = 0;
> > > 
> > > -	}
> > > +	while (nvme_read_cqe(nvmeq, &cqe)) {
> > > +		nvme_handle_cqe(nvmeq, &cqe);
> > > 
> > > -	/* If the controller ignores the cq head doorbell and continuously
> > > -	 * writes to the queue, it is theoretically possible to wrap around
> > > -	 * the queue twice and mistakenly return IRQ_NONE.  Linux only
> > > -	 * requires that 0.1% of your interrupts are handled, so this isn't
> > > -	 * a big problem.
> > > -	 */
> > > -	if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
> > > -		return;
> > > +		if (tag && *tag == cqe.command_id) {
> > > +			*tag = -1;
> > > +			break;
> > > +		}
> > > +	}
> > > 
> > > -	if (likely(nvmeq->cq_vector >= 0))
> > > -		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> > > -	nvmeq->cq_head = head;
> > > -	nvmeq->cq_phase = phase;
> > > +	if (consumed)
> > > +		nvme_ring_cq_doorbell(nvmeq);
> > > 
> > > -	nvmeq->cqe_seen = 1;
> > > +	return consumed;
> > >  }
> > 
> > Won't 'consumed' always be 0 here and we thus never call
> > nvme_ring_cq_doorbell()? Am I overlooking something here, or is this
> > just for preparation of later patches?
> 
> that incrementation was lost in a squash. I'll restore it in v2.

OK, sounds good.

Please add my 
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
to said v2.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing
  2016-10-05 16:57     ` Sagi Grimberg
@ 2016-10-10  7:47       ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-10  7:47 UTC (permalink / raw)


On Wed, Oct 05, 2016@07:57:11PM +0300, Sagi Grimberg wrote:
> 
> > > @@ -49,6 +50,7 @@
> > >  #define NVME_AQ_DEPTH		256
> > >  #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
> > >  #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
> > > +#define NVME_POLL_BUDGET_IRQ	256
> > 
> > Is there a reason for the 256 or is it just a nicely suited value?
> 
> Umm, No good reason behind it, I used this value before for irq-poll
> and it seemed to fit best. We can try other budgets or have it
> configurable. The point is to not abuse the soft-irq context for too
> long and maintain fairness between completion queues so we just need
> a reasonable value.

Ah OK. I'm not sure whether we should impose a limit (be it a wild guess or a
measurement) or make it adjustable via sysfs. IIRC there is a sysctl to tune
the NAPI budget in the networking stack, which then again raises the question
of doing it global or on a per controller basis.

> 
> 
> > Especially
> > as you've been using INT_MAX as a budget for CQ processing before (which I
> > think is kinda high).
> 
> That was just for keeping the existing logic before we change to
> irq-poll.

Sounds reasonable.

Anyways, I'm looking forward to getting this merged. We can sort out the
tune-ability later on I think.

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll
  2016-10-05 17:02     ` Sagi Grimberg
@ 2016-10-10  7:49       ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-10  7:49 UTC (permalink / raw)


On Wed, Oct 05, 2016@08:02:15PM +0300, Sagi Grimberg wrote:
> 
> > > Given that the code is simple enough it seems better
> > > then passing a tag by reference for each call site.
> > > 
> > > Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> > > ---
> > >  drivers/nvme/host/pci.c | 36 +++++++++++++++++++++---------------
> > >  1 file changed, 21 insertions(+), 15 deletions(-)
> > > 
> > 
> > I don't really see how the call by value vs. call by reference change is a
> > better trade-off than open coding __nvme_process_cq(), as we'd have to
> > duplicate any possible fix for this function then (yes I see it's quite
> > simple but I have some doubt it's 100% bugfree).
> 
> I just figured it'd be nicer to remove the tag that we carry in all
> the call sites because we want to reuse the code in nvme_poll(). It
> made better sense before, but now that its simpler (for me at least)
> we can lose it.
> 
> Having said that, we can easily remove this patch if we're not all
> on board...

Would be nice to have a 3rd opinion here. Jens, Keith, Christoph?

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
                   ` (6 preceding siblings ...)
  2016-10-05 21:47 ` [PATCH rfc 0/6] convert nvme pci to use irq-poll service Keith Busch
@ 2016-10-27 10:09 ` Johannes Thumshirn
  2016-10-27 10:31   ` Sagi Grimberg
  7 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-27 10:09 UTC (permalink / raw)


On Wed, Oct 05, 2016@12:42:08PM +0300, Sagi Grimberg wrote:
> Currently we have a couple of problems with our completion processing scheme:
> 
> 1. We abuse the polling context by doing it from hard-irq (which is
> why the threaded interrupts mode was introduced). We can possibly
> stay there for too long causing us to hard-lockup (can be triggered
> easily by running heavy randread workloads on systems with lots of
> cpu cores).
> 
> 2. We lack fairness between completion queues that share the same
> MSI/MSIX assignment (completion queues that belong to different
> devices). We need to drain a completion queue completely before
> we can process another completion queue.
> 
> irq-poll service solves both by correctly budgeting the completion
> processing contexts and keeping per-cpu queues of completion queues.
> By using it we can reduce the number of overall nvme interrupts in
> the system which is a bonus.
> 
> I ran some tests with this and it seemed to work pretty well with
> my low-end nvme devices. One phenomenon I've encountered was that
> for single core long queue-depth'ed randread workload I saw around
> ~8-10% iops decrease. However when running multi-core IO I didn't
> see any noticeable performance degradation. non-polling Canonical
> randread latency doesn't seem to be affected as well. And also
> polling mode IO is not affected as expected.
> 
> So in addition for review and feedback, this is a call for testing
> and benchmarking as this touches the critical data path.
> 
> Sagi Grimberg (6):
>   nvme-pci: Split __nvme_process_cq to poll and handle
>   nvme-pci: Add budget to __nvme_process_cq
>   nvme-pci: Use irq-poll for completion processing
>   nvme: don't consume cq in queue_rq
>   nvme-pci: open-code polling logic in nvme_poll
>   nvme-pci: Get rid of threaded interrupts
> 
>  drivers/nvme/host/Kconfig |   1 +
>  drivers/nvme/host/pci.c   | 179 ++++++++++++++++++++++++++--------------------
>  2 files changed, 101 insertions(+), 79 deletions(-)
> 
> -- 
> 2.7.4

Hi Sagi,

I'm wondering if I've somehow missed the v2 of this series? 

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-27 10:09 ` Johannes Thumshirn
@ 2016-10-27 10:31   ` Sagi Grimberg
  2016-10-27 10:58     ` Johannes Thumshirn
  0 siblings, 1 reply; 27+ messages in thread
From: Sagi Grimberg @ 2016-10-27 10:31 UTC (permalink / raw)



> Hi Sagi,

Hey Johannes,

> I'm wondering if I've somehow missed the v2 of this series?

You didn't miss it.

I currently have it in:
git://git.infradead.org/nvme-fabrics.git   nvme-irq-poll

 From my tests with Qemu nvme I still see ~10% performance
degradation for 4K IOs and cannot explain where its
coming from. But I think we need some real setup tests as
well.

Hence, I don't feel comfortable getting this in at this point.
I would appreciate if people can also try it out and provide
some insight (Jens, Christoph, Keith).

The networking folks are using napi everywhere and can reach
much lower latencies than NVMe so IMO we just need to identify
some gaps. I have some micro-optimization patches for irq-poll
but they don't seem to make a real difference.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH rfc 0/6] convert nvme pci to use irq-poll service
  2016-10-27 10:31   ` Sagi Grimberg
@ 2016-10-27 10:58     ` Johannes Thumshirn
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2016-10-27 10:58 UTC (permalink / raw)


On Thu, Oct 27, 2016@01:31:47PM +0300, Sagi Grimberg wrote:
> 
> > Hi Sagi,
> 
> Hey Johannes,
> 
> > I'm wondering if I've somehow missed the v2 of this series?
> 
> You didn't miss it.
> 
> I currently have it in:
> git://git.infradead.org/nvme-fabrics.git   nvme-irq-poll

pulled thanks.

> 
> From my tests with Qemu nvme I still see ~10% performance
> degradation for 4K IOs and cannot explain where its
> coming from. But I think we need some real setup tests as
> well.

Are just using the Qemu NVMe over PCI emulation or some Fabrics thingy? Maybe
I can have a look at it as well and at least act as your test monkey here.

> 
> Hence, I don't feel comfortable getting this in at this point.
> I would appreciate if people can also try it out and provide
> some insight (Jens, Christoph, Keith).
> 
> The networking folks are using napi everywhere and can reach
> much lower latencies than NVMe so IMO we just need to identify
> some gaps. I have some micro-optimization patches for irq-poll
> but they don't seem to make a real difference.

I'll have a look.

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-10-27 10:58 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-05  9:42 [PATCH rfc 0/6] convert nvme pci to use irq-poll service Sagi Grimberg
2016-10-05  9:42 ` [PATCH rfc 1/6] nvme-pci: Split __nvme_process_cq to poll and handle Sagi Grimberg
2016-10-05 13:21   ` Johannes Thumshirn
2016-10-05 16:52     ` Sagi Grimberg
2016-10-05 19:49       ` Jon Derrick
2016-10-10  7:25       ` Johannes Thumshirn
2016-10-05  9:42 ` [PATCH rfc 2/6] nvme-pci: Add budget to __nvme_process_cq Sagi Grimberg
2016-10-05 13:26   ` Johannes Thumshirn
2016-10-05  9:42 ` [PATCH rfc 3/6] nvme-pci: Use irq-poll for completion processing Sagi Grimberg
2016-10-05 13:40   ` Johannes Thumshirn
2016-10-05 16:57     ` Sagi Grimberg
2016-10-10  7:47       ` Johannes Thumshirn
2016-10-05  9:42 ` [PATCH rfc 4/6] nvme: don't consume cq in queue_rq Sagi Grimberg
2016-10-05 13:42   ` Johannes Thumshirn
2016-10-05  9:42 ` [PATCH rfc 5/6] nvme-pci: open-code polling logic in nvme_poll Sagi Grimberg
2016-10-05 13:52   ` Johannes Thumshirn
2016-10-05 17:02     ` Sagi Grimberg
2016-10-10  7:49       ` Johannes Thumshirn
2016-10-05  9:42 ` [PATCH rfc 6/6] nvme-pci: Get rid of threaded interrupts Sagi Grimberg
2016-10-05 13:47   ` Johannes Thumshirn
2016-10-05 21:47 ` [PATCH rfc 0/6] convert nvme pci to use irq-poll service Keith Busch
2016-10-05 21:55   ` Sagi Grimberg
2016-10-05 22:49     ` Keith Busch
2016-10-05 22:48       ` Sagi Grimberg
2016-10-27 10:09 ` Johannes Thumshirn
2016-10-27 10:31   ` Sagi Grimberg
2016-10-27 10:58     ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.