Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHv3 0/4] nvme pci interrupt handling improvements
@ 2019-12-09 17:56 Keith Busch
  2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Keith Busch @ 2019-12-09 17:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, Keith Busch, tglx, hch

Here's the next revision to nvme pci interrupt handling. The series is
attempting to address two issues that have been recently raised:

  1. The nvme driver makes it possible to soft lockup a CPU due to high
     utilization in irq context. This is most prevalent when multiple
     CPUs are driving a single hardware queue on one or more controllers.

  2. The threaded handler left interrupts unmasked, which breaks when
     used with level triggered interrupts, or unnecessarily runs
     in interrupt context when edge interrupts occur frequently.

Both issues are addressed by always configuring nvme interrupts to
run the threaded handler with interrupts disabled. A hybrid approch
to handling nvme completions in hard irq context and thread context is
introduced so that there should not be a performance impact from removing
the nvme.use_threaded_interrupts option.

The series appears to be a win or no impact on performance from what I
can test. I would be greatful to hear if others can confirm this with
other hardware.

I've dropped the fast MSIx masking. While I believe it's safe to skip
flushing the memory write, I think this series mitigates the impact of
the read back by ensuring the ratio of memory reads to IO is low enough
to be negligable (AFAICT on hardware available to me).

I've changed the exit condition for the polling nvme irq thread to
break out of the loop if we've wrapped the completion queue. Other irq
threads may be affinitized to the same CPU, so we need to schedule out
at some point, but I've been told multiple times that need_resched()
or cond_resched() won't work as desired from the thread's fifo priority.

Keith Busch (4):
  nvme/pci: Disable interrupts for threaded handler
  nvme/pci: Complete commands from primary handler
  nvme/pci: Remove use_threaded_interrupts
  nvme/pci: Poll threaded completions

 drivers/nvme/host/pci.c | 56 ++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 20 deletions(-)

-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-09 17:56 [PATCHv3 0/4] nvme pci interrupt handling improvements Keith Busch
@ 2019-12-09 17:56 ` Keith Busch
  2019-12-10 15:12   ` Daniel Wagner
  2019-12-12  9:09   ` Christoph Hellwig
  2019-12-09 17:56 ` [PATCHv3 2/4] nvme/pci: Complete commands from primary handler Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Keith Busch @ 2019-12-09 17:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, Keith Busch, tglx, hch

The nvme driver had always left the device in a state capable of sending
more interrupts while completions were being handled in the thread.
Re-entering the primary handler disrupts the threaded handler since
they run on the same CPU. The primary handler may also be detected as
spurious when it returns IRQ_WAKE_THREAD if the threaded handler does
not finish frequently enough due to completing many requests.

Disable the irq to ensure the primary handler will not be re-entered while
the threaded handler is running. Use the nvme specific interrupt mask
set/clear registers for MSI and legacy, and rely on the irq_flow_handler
to mask the interrupt for MSIx.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 365a2ddbeaa7..a0138e3ca0b9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1036,12 +1036,30 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
+static irqreturn_t nvme_irq_thread(int irq, void *data)
+{
+	struct nvme_queue *nvmeq = data;
+	irqreturn_t ret = nvme_irq(irq, data);
+
+	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
+		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC);
+
+	enable_irq(irq);
+	return ret;
+}
+
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	if (nvme_cqe_pending(nvmeq))
-		return IRQ_WAKE_THREAD;
-	return IRQ_NONE;
+
+	if (!nvme_cqe_pending(nvmeq))
+		return IRQ_NONE;
+
+	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
+		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
+
+	disable_irq_nosync(irq);
+	return IRQ_WAKE_THREAD;
 }
 
 /*
@@ -1499,7 +1517,7 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 
 	if (use_threaded_interrupts) {
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+			nvme_irq_thread, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	} else {
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
 				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-09 17:56 [PATCHv3 0/4] nvme pci interrupt handling improvements Keith Busch
  2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
@ 2019-12-09 17:56 ` Keith Busch
  2019-12-10 20:00   ` Sagi Grimberg
  2019-12-12  9:14   ` Christoph Hellwig
  2019-12-09 17:56 ` [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts Keith Busch
  2019-12-09 17:56 ` [PATCHv3 4/4] nvme/pci: Poll threaded completions Keith Busch
  3 siblings, 2 replies; 29+ messages in thread
From: Keith Busch @ 2019-12-09 17:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, Keith Busch, tglx, hch

The context switch to wake the threaded interrupt handler has enough
overhead to harm latency for low queue depth workloads. Complete one cycle
through the completion queue in the primary handler to ensure threaded
interrupts have no disadvantage compared to not using threaded interrupts. Wake
the thread only if more completions remain after processing the queue
once since that indicates a higher queue depth workload that may trigger
frequent interrupts which would be better completed in the handler thread
with device interrupts disabled.

Completing requests from two different contexts is safe since nvme driver
disables entry to the primary handler when it returns IRQ_WAKE_THREAD.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a0138e3ca0b9..e415fadf7331 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1055,6 +1055,10 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	if (!nvme_cqe_pending(nvmeq))
 		return IRQ_NONE;
 
+	nvme_irq(irq, data);
+	if (!nvme_cqe_pending(nvmeq))
+		return IRQ_HANDLED;
+
 	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
 		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
 
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-09 17:56 [PATCHv3 0/4] nvme pci interrupt handling improvements Keith Busch
  2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
  2019-12-09 17:56 ` [PATCHv3 2/4] nvme/pci: Complete commands from primary handler Keith Busch
@ 2019-12-09 17:56 ` Keith Busch
  2019-12-12  9:14   ` Christoph Hellwig
  2019-12-09 17:56 ` [PATCHv3 4/4] nvme/pci: Poll threaded completions Keith Busch
  3 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-09 17:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, Keith Busch, tglx, hch

The nvme threaded interrupt handling provides quick completions for
latency sensitive workloads, and threaded handlers for more IOPS intensive
ones. Remove the use_threaded_interrupts nvme module parameter since
leaving it disabled should not be providing a benefit.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e415fadf7331..28e08c5ab412 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -40,9 +40,6 @@
 #define NVME_MAX_KB_SZ	4096
 #define NVME_MAX_SEGS	127
 
-static int use_threaded_interrupts;
-module_param(use_threaded_interrupts, int, 0);
-
 static bool use_cmb_sqes = true;
 module_param(use_cmb_sqes, bool, 0444);
 MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes");
@@ -1519,13 +1516,8 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
 	int nr = nvmeq->dev->ctrl.instance;
 
-	if (use_threaded_interrupts) {
-		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-			nvme_irq_thread, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
-	} else {
-		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
-				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
-	}
+	return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
+		nvme_irq_thread, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv3 4/4] nvme/pci: Poll threaded completions
  2019-12-09 17:56 [PATCHv3 0/4] nvme pci interrupt handling improvements Keith Busch
                   ` (2 preceding siblings ...)
  2019-12-09 17:56 ` [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts Keith Busch
@ 2019-12-09 17:56 ` Keith Busch
  2019-12-10 17:43   ` Daniel Wagner
  3 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-09 17:56 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, Keith Busch, tglx, hch

Poll for new completions in the nvme irq thread as long as it continues
to observe new ones are available.

We need to ensure other irq threads affinitized to the same CPU will
get a chance to run, so exit polling if we wrap the completion queue.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 28e08c5ab412..487c6b3858fe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1011,38 +1011,40 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
 	return found;
 }
 
-static irqreturn_t nvme_irq(int irq, void *data)
+static int nvme_irq(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	irqreturn_t ret = IRQ_NONE;
 	u16 start, end;
+	int found;
 
 	/*
 	 * The rmb/wmb pair ensures we see all updates from a previous run of
 	 * the irq handler, even if that was on another CPU.
 	 */
 	rmb();
-	nvme_process_cq(nvmeq, &start, &end, -1);
+	found = nvme_process_cq(nvmeq, &start, &end, -1);
 	wmb();
 
-	if (start != end) {
+	if (start != end)
 		nvme_complete_cqes(nvmeq, start, end);
-		return IRQ_HANDLED;
-	}
 
-	return ret;
+	return found;
 }
 
 static irqreturn_t nvme_irq_thread(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	irqreturn_t ret = nvme_irq(irq, data);
+	int found, total = 0;
+
+	do {
+		found = nvme_irq(irq, data);
+		total += found;
+	} while (found && total < nvmeq->q_depth * 2);
 
 	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
 		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC);
-
 	enable_irq(irq);
-	return ret;
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t nvme_irq_check(int irq, void *data)
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
@ 2019-12-10 15:12   ` Daniel Wagner
  2019-12-10 15:28     ` Sebastian Andrzej Siewior
  2019-12-12  9:09   ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2019-12-10 15:12 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch

Hi Keith,

On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
>  static irqreturn_t nvme_irq_check(int irq, void *data)
>  {
>  	struct nvme_queue *nvmeq = data;
> -	if (nvme_cqe_pending(nvmeq))
> -		return IRQ_WAKE_THREAD;
> -	return IRQ_NONE;
> +
> +	if (!nvme_cqe_pending(nvmeq))
> +		return IRQ_NONE;

IIRC, returning IRQ_NONE tells the irq core code that this is
an unhandled IRQ. If the core sees too many of them it will print
'spurious irq' message. See __handle_irq_event_percpu,
handle_irq_event_percpu and note_interrupt

Thanks,
Daniel

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-10 15:12   ` Daniel Wagner
@ 2019-12-10 15:28     ` Sebastian Andrzej Siewior
  2019-12-10 15:54       ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-10 15:28 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: sagi, linux-nvme, ming.lei, Keith Busch, tglx, hch

On 2019-12-10 16:12:57 [+0100], Daniel Wagner wrote:
> Hi Keith,
> 
> On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> >  static irqreturn_t nvme_irq_check(int irq, void *data)
> >  {
> >  	struct nvme_queue *nvmeq = data;
> > -	if (nvme_cqe_pending(nvmeq))
> > -		return IRQ_WAKE_THREAD;
> > -	return IRQ_NONE;
> > +
> > +	if (!nvme_cqe_pending(nvmeq))
> > +		return IRQ_NONE;
> 
> IIRC, returning IRQ_NONE tells the irq core code that this is
> an unhandled IRQ. If the core sees too many of them it will print
> 'spurious irq' message. See __handle_irq_event_percpu,
> handle_irq_event_percpu and note_interrupt

This is okay / intended. If nvme_cqe_pending() does not recognise that
this interrupt belongs to that device then it should return IRQ_NONE.

> Thanks,
> Daniel

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-10 15:28     ` Sebastian Andrzej Siewior
@ 2019-12-10 15:54       ` Keith Busch
  2019-12-10 16:44         ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-10 15:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: sagi, Daniel Wagner, linux-nvme, ming.lei, tglx, hch

On Tue, Dec 10, 2019 at 04:28:40PM +0100, Sebastian Andrzej Siewior wrote:
> On 2019-12-10 16:12:57 [+0100], Daniel Wagner wrote:
> > On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> > >  static irqreturn_t nvme_irq_check(int irq, void *data)
> > >  {
> > >  	struct nvme_queue *nvmeq = data;
> > > -	if (nvme_cqe_pending(nvmeq))
> > > -		return IRQ_WAKE_THREAD;
> > > -	return IRQ_NONE;
> > > +
> > > +	if (!nvme_cqe_pending(nvmeq))
> > > +		return IRQ_NONE;
> > 
> > IIRC, returning IRQ_NONE tells the irq core code that this is
> > an unhandled IRQ. If the core sees too many of them it will print
> > 'spurious irq' message. See __handle_irq_event_percpu,
> > handle_irq_event_percpu and note_interrupt
> 
> This is okay / intended. If nvme_cqe_pending() does not recognise that
> this interrupt belongs to that device then it should return IRQ_NONE.

Also note this patch doesn't change the condition when IRQ_NONE is
returned. We've been doing that since nvme's beginning, as it really is
a spurious interrupt from a likely malfunctioning device.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-10 15:54       ` Keith Busch
@ 2019-12-10 16:44         ` Daniel Wagner
  2019-12-10 16:57           ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2019-12-10 16:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: sagi, Sebastian Andrzej Siewior, linux-nvme, ming.lei, tglx, hch

On Wed, Dec 11, 2019 at 12:54:33AM +0900, Keith Busch wrote:
> On Tue, Dec 10, 2019 at 04:28:40PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2019-12-10 16:12:57 [+0100], Daniel Wagner wrote:
> > > On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> > > >  static irqreturn_t nvme_irq_check(int irq, void *data)
> > > >  {
> > > >  	struct nvme_queue *nvmeq = data;
> > > > -	if (nvme_cqe_pending(nvmeq))
> > > > -		return IRQ_WAKE_THREAD;
> > > > -	return IRQ_NONE;
> > > > +
> > > > +	if (!nvme_cqe_pending(nvmeq))
> > > > +		return IRQ_NONE;
> > > 
> > > IIRC, returning IRQ_NONE tells the irq core code that this is
> > > an unhandled IRQ. If the core sees too many of them it will print
> > > 'spurious irq' message. See __handle_irq_event_percpu,
> > > handle_irq_event_percpu and note_interrupt
> > 
> > This is okay / intended. If nvme_cqe_pending() does not recognise that
> > this interrupt belongs to that device then it should return IRQ_NONE.
> 
> Also note this patch doesn't change the condition when IRQ_NONE is
> returned. We've been doing that since nvme's beginning, as it really is
> a spurious interrupt from a likely malfunctioning device.

Sure, I am just a bit confused how IRQ_NONE is used. The comment on
nvme_cqe_pending says "We read the CQE phase first to check if the
rest of the entry is valid", which I read as "check if we have
anything to do" and not "this irq doesn't belong to this queue".

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-10 16:44         ` Daniel Wagner
@ 2019-12-10 16:57           ` Keith Busch
  2019-12-10 17:11             ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-10 16:57 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: sagi, Sebastian Andrzej Siewior, linux-nvme, ming.lei, tglx, hch

On Tue, Dec 10, 2019 at 05:44:56PM +0100, Daniel Wagner wrote:
> On Wed, Dec 11, 2019 at 12:54:33AM +0900, Keith Busch wrote:
> > On Tue, Dec 10, 2019 at 04:28:40PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2019-12-10 16:12:57 [+0100], Daniel Wagner wrote:
> > > > On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> > > > >  static irqreturn_t nvme_irq_check(int irq, void *data)
> > > > >  {
> > > > >  	struct nvme_queue *nvmeq = data;
> > > > > -	if (nvme_cqe_pending(nvmeq))
> > > > > -		return IRQ_WAKE_THREAD;
> > > > > -	return IRQ_NONE;
> > > > > +
> > > > > +	if (!nvme_cqe_pending(nvmeq))
> > > > > +		return IRQ_NONE;
> > > > 
> > > > IIRC, returning IRQ_NONE tells the irq core code that this is
> > > > an unhandled IRQ. If the core sees too many of them it will print
> > > > 'spurious irq' message. See __handle_irq_event_percpu,
> > > > handle_irq_event_percpu and note_interrupt
> > > 
> > > This is okay / intended. If nvme_cqe_pending() does not recognise that
> > > this interrupt belongs to that device then it should return IRQ_NONE.
> > 
> > Also note this patch doesn't change the condition when IRQ_NONE is
> > returned. We've been doing that since nvme's beginning, as it really is
> > a spurious interrupt from a likely malfunctioning device.
> 
> Sure, I am just a bit confused how IRQ_NONE is used. The comment on
> nvme_cqe_pending says "We read the CQE phase first to check if the
> rest of the entry is valid", which I read as "check if we have
> anything to do" and not "this irq doesn't belong to this queue".

Those two interpretations sound like the same thing to me. :)

The interrupt notifies the driver it has something to do, and reading
the CQE phase confirms whether or not that's true.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-10 16:57           ` Keith Busch
@ 2019-12-10 17:11             ` Daniel Wagner
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2019-12-10 17:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: sagi, Sebastian Andrzej Siewior, linux-nvme, ming.lei, tglx, hch

On Wed, Dec 11, 2019 at 01:57:04AM +0900, Keith Busch wrote:
> On Tue, Dec 10, 2019 at 05:44:56PM +0100, Daniel Wagner wrote:
> > Sure, I am just a bit confused how IRQ_NONE is used. The comment on
> > nvme_cqe_pending says "We read the CQE phase first to check if the
> > rest of the entry is valid", which I read as "check if we have
> > anything to do" and not "this irq doesn't belong to this queue".
> 
> Those two interpretations sound like the same thing to me. :)

:)

I was confused why 'nothing to do' is then interpreted as IRQ was
routed wrongly.

> The interrupt notifies the driver it has something to do, and reading
> the CQE phase confirms whether or not that's true.

I think the thing I am missing the design idea behind it. I suppose the
IRQ on only fires when previously work was posted.

Sorry for the noise.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 4/4] nvme/pci: Poll threaded completions
  2019-12-09 17:56 ` [PATCHv3 4/4] nvme/pci: Poll threaded completions Keith Busch
@ 2019-12-10 17:43   ` Daniel Wagner
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2019-12-10 17:43 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch

Hi Keith,

On Tue, Dec 10, 2019 at 02:56:22AM +0900, Keith Busch wrote:
> Poll for new completions in the nvme irq thread as long as it continues
> to observe new ones are available.
> 
> We need to ensure other irq threads affinitized to the same CPU will
> get a chance to run, so exit polling if we wrap the completion queue.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/nvme/host/pci.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 28e08c5ab412..487c6b3858fe 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1011,38 +1011,40 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
>  	return found;
>  }
>  
> -static irqreturn_t nvme_irq(int irq, void *data)
> +static int nvme_irq(int irq, void *data)
>  {
>  	struct nvme_queue *nvmeq = data;
> -	irqreturn_t ret = IRQ_NONE;
>  	u16 start, end;
> +	int found;
>  
>  	/*
>  	 * The rmb/wmb pair ensures we see all updates from a previous run of
>  	 * the irq handler, even if that was on another CPU.
>  	 */
>  	rmb();
> -	nvme_process_cq(nvmeq, &start, &end, -1);
> +	found = nvme_process_cq(nvmeq, &start, &end, -1);
>  	wmb();

Are the memory barries still at the right place after this change?

Thanks,
Daniel

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-09 17:56 ` [PATCHv3 2/4] nvme/pci: Complete commands from primary handler Keith Busch
@ 2019-12-10 20:00   ` Sagi Grimberg
  2019-12-10 20:25     ` Keith Busch
  2019-12-12  9:14   ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Sagi Grimberg @ 2019-12-10 20:00 UTC (permalink / raw)
  To: Keith Busch, linux-nvme; +Cc: bigeasy, hch, tglx, ming.lei


> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a0138e3ca0b9..e415fadf7331 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1055,6 +1055,10 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>   	if (!nvme_cqe_pending(nvmeq))
>   		return IRQ_NONE;
>   
> +	nvme_irq(irq, data);
> +	if (!nvme_cqe_pending(nvmeq))
> +		return IRQ_HANDLED;
> +

Question Keith,

If I have say 24 (or more) devices with a queue mapped to a cpu, and we
happen to just reap in the primary handler for all devices, all the 
time, are we safe from cpu lockup?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-10 20:00   ` Sagi Grimberg
@ 2019-12-10 20:25     ` Keith Busch
  2019-12-10 21:14       ` Sagi Grimberg
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-10 20:25 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: bigeasy, hch, tglx, linux-nvme, ming.lei

On Tue, Dec 10, 2019 at 12:00:07PM -0800, Sagi Grimberg wrote:
> 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index a0138e3ca0b9..e415fadf7331 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1055,6 +1055,10 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
> >   	if (!nvme_cqe_pending(nvmeq))
> >   		return IRQ_NONE;
> > +	nvme_irq(irq, data);
> > +	if (!nvme_cqe_pending(nvmeq))
> > +		return IRQ_HANDLED;
> > +
> 
> Question Keith,
> 
> If I have say 24 (or more) devices with a queue mapped to a cpu, and we
> happen to just reap in the primary handler for all devices, all the time,
> are we safe from cpu lockup?

I can't readily test that scenario, but I'm skeptical a workload
can keep the primary handler running without ever seeing it return
IRQ_WAKE_THREAD. If that is really a problem, we can mitigate it by
limiting the number of CQEs the primary handler may process.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-10 20:25     ` Keith Busch
@ 2019-12-10 21:14       ` Sagi Grimberg
  2019-12-11 17:35         ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Sagi Grimberg @ 2019-12-10 21:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, tglx, hch, linux-nvme, ming.lei


>> Question Keith,
>>
>> If I have say 24 (or more) devices with a queue mapped to a cpu, and we
>> happen to just reap in the primary handler for all devices, all the time,
>> are we safe from cpu lockup?
> 
> I can't readily test that scenario, but I'm skeptical a workload
> can keep the primary handler running without ever seeing it return
> IRQ_WAKE_THREAD. If that is really a problem, we can mitigate it by
> limiting the number of CQEs the primary handler may process.

Theoretically speaking, even if you limit to 1 cqe, the universe can
align such that you will always reap in the primary handler right?
The more devices we have in the system mapping a vector to the
same cpu core will increase the likelihood of this happening.

I guess it is a corner-case, but I'm thinking we want to be a 100%
covered. Usually what drivers do is defer to a kthread/tasklet/napi
context without reaping in the primary handler.

So if we have this optimization, perhaps something else in the irq
infrastructure should take care of cpu lockup prevention?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-10 21:14       ` Sagi Grimberg
@ 2019-12-11 17:35         ` Keith Busch
  2019-12-12  0:40           ` Sagi Grimberg
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-11 17:35 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: bigeasy, tglx, hch, linux-nvme, ming.lei

On Tue, Dec 10, 2019 at 01:14:19PM -0800, Sagi Grimberg wrote:
> > > If I have say 24 (or more) devices with a queue mapped to a cpu, and we
> > > happen to just reap in the primary handler for all devices, all the time,
> > > are we safe from cpu lockup?
> > 
> > I can't readily test that scenario, but I'm skeptical a workload
> > can keep the primary handler running without ever seeing it return
> > IRQ_WAKE_THREAD. If that is really a problem, we can mitigate it by
> > limiting the number of CQEs the primary handler may process.
> 
> Theoretically speaking, even if you limit to 1 cqe, the universe can
> align such that you will always reap in the primary handler right?

Perhaps theoretically, though testing the limits of reason.

> So if we have this optimization, perhaps something else in the irq
> infrastructure should take care of cpu lockup prevention?

Perhaps we can cycle the effective_affinity through the smp_affinity?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-11 17:35         ` Keith Busch
@ 2019-12-12  0:40           ` Sagi Grimberg
  2019-12-12  1:02             ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Sagi Grimberg @ 2019-12-12  0:40 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, ming.lei, tglx, linux-nvme, hch


>>>> If I have say 24 (or more) devices with a queue mapped to a cpu, and we
>>>> happen to just reap in the primary handler for all devices, all the time,
>>>> are we safe from cpu lockup?
>>>
>>> I can't readily test that scenario, but I'm skeptical a workload
>>> can keep the primary handler running without ever seeing it return
>>> IRQ_WAKE_THREAD. If that is really a problem, we can mitigate it by
>>> limiting the number of CQEs the primary handler may process.
>>
>> Theoretically speaking, even if you limit to 1 cqe, the universe can
>> align such that you will always reap in the primary handler right?
> 
> Perhaps theoretically, though testing the limits of reason.

I know, maybe we should handle it when/if this really becomes a
problem...

>> So if we have this optimization, perhaps something else in the irq
>> infrastructure should take care of cpu lockup prevention?
> 
> Perhaps we can cycle the effective_affinity through the smp_affinity?

Not sure I follow your thoughts.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-12  0:40           ` Sagi Grimberg
@ 2019-12-12  1:02             ` Keith Busch
  2019-12-12 22:55               ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-12  1:02 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: bigeasy, ming.lei, tglx, linux-nvme, hch

On Wed, Dec 11, 2019 at 04:40:47PM -0800, Sagi Grimberg wrote:
> > Perhaps we can cycle the effective_affinity through the smp_affinity?
> 
> Not sure I follow your thoughts.

The only way the nvme driver's interrupt handler can saturate a
cpu requires the smp_affinity have multiple online cpus set. The
effective_affinity, however, is always a single cpu, which can be any
of the online ones set in smp_affinity.

If we can detect a cpu is spending too many cycles in nvme irq,
perhaps we can re-spread the effective_affinity to different cpus. I
don't currently know of a good way to do this, though.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
  2019-12-10 15:12   ` Daniel Wagner
@ 2019-12-12  9:09   ` Christoph Hellwig
  2019-12-12 15:53     ` Keith Busch
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2019-12-12  9:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch

On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> +	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
> +		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
> +
> +	disable_irq_nosync(irq);

Why do we need both the device-level disable and disable_irq_nosync
for the non-MSI-X case here?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-09 17:56 ` [PATCHv3 2/4] nvme/pci: Complete commands from primary handler Keith Busch
  2019-12-10 20:00   ` Sagi Grimberg
@ 2019-12-12  9:14   ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2019-12-12  9:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch

On Tue, Dec 10, 2019 at 02:56:20AM +0900, Keith Busch wrote:
> The context switch to wake the threaded interrupt handler has enough
> overhead to harm latency for low queue depth workloads. Complete one cycle
> through the completion queue in the primary handler to ensure threaded
> interrupts have no disadvantage compared to not using threaded interrupts. Wake
> the thread only if more completions remain after processing the queue
> once since that indicates a higher queue depth workload that may trigger
> frequent interrupts which would be better completed in the handler thread
> with device interrupts disabled.
> 
> Completing requests from two different contexts is safe since nvme driver
> disables entry to the primary handler when it returns IRQ_WAKE_THREAD.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/nvme/host/pci.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a0138e3ca0b9..e415fadf7331 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1055,6 +1055,10 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>  	if (!nvme_cqe_pending(nvmeq))
>  		return IRQ_NONE;
>  
> +	nvme_irq(irq, data);

I think we can just open code nvme_irq here.  That also means we can
use the return value of nvme_process_cq instead of the extra
nvme_cqe_pending check above.  This also needs a good comment explaining
the scheme at least.

> +	if (!nvme_cqe_pending(nvmeq))
> +		return IRQ_HANDLED;

Including one explaining this check.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-09 17:56 ` [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts Keith Busch
@ 2019-12-12  9:14   ` Christoph Hellwig
  2019-12-12 15:45     ` Keith Busch
  2019-12-18  7:29     ` Ming Lei
  0 siblings, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2019-12-12  9:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch

On Tue, Dec 10, 2019 at 02:56:21AM +0900, Keith Busch wrote:
> The nvme threaded interrupt handling provides quick completions for
> latency sensitive workloads, and threaded handlers for more IOPS intensive
> ones. Remove the use_threaded_interrupts nvme module parameter since
> leaving it disabled should not be providing a benefit.

I think we need some careful benchmarking and numbers to justify the
switch.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-12  9:14   ` Christoph Hellwig
@ 2019-12-12 15:45     ` Keith Busch
  2019-12-18  7:29     ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Keith Busch @ 2019-12-12 15:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bigeasy, tglx, sagi, linux-nvme, ming.lei

On Thu, Dec 12, 2019 at 10:14:33AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 10, 2019 at 02:56:21AM +0900, Keith Busch wrote:
> > The nvme threaded interrupt handling provides quick completions for
> > latency sensitive workloads, and threaded handlers for more IOPS intensive
> > ones. Remove the use_threaded_interrupts nvme module parameter since
> > leaving it disabled should not be providing a benefit.
> 
> I think we need some careful benchmarking and numbers to justify the
> switch.

Yes, I don't consider this 5.5 material, so we've some time.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler
  2019-12-12  9:09   ` Christoph Hellwig
@ 2019-12-12 15:53     ` Keith Busch
  0 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2019-12-12 15:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bigeasy, tglx, sagi, linux-nvme, ming.lei

On Thu, Dec 12, 2019 at 10:09:52AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 10, 2019 at 02:56:19AM +0900, Keith Busch wrote:
> > +	if (!to_pci_dev(nvmeq->dev->dev)->msix_enabled)
> > +		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
> > +
> > +	disable_irq_nosync(irq);
> 
> Why do we need both the device-level disable and disable_irq_nosync
> for the non-MSI-X case here?

There's a race condition that needs to be handled here. We mask the
interrupt on the device, but the device may have sent an MSI at the
same time that the APIC hasn't seen yet, and that would retrigger
the interrupt handler. We need it to see irqd_irq_disabled() is true,
otherwise the nvme primary and thread may run at the same time, which
we don't want in order to poll the cq from both.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-12  1:02             ` Keith Busch
@ 2019-12-12 22:55               ` Ming Lei
  2019-12-12 23:30                 ` Keith Busch
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-12-12 22:55 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, tglx, Sagi Grimberg, linux-nvme, hch

On Thu, Dec 12, 2019 at 10:02:40AM +0900, Keith Busch wrote:
> On Wed, Dec 11, 2019 at 04:40:47PM -0800, Sagi Grimberg wrote:
> > > Perhaps we can cycle the effective_affinity through the smp_affinity?
> > 
> > Not sure I follow your thoughts.
> 
> The only way the nvme driver's interrupt handler can saturate a
> cpu requires the smp_affinity have multiple online cpus set. The

As Sagi mentioned, there can be 24 drives, each drive can be 1:1
mapping, so each CPU may have to handle interrupts from 24 drives/queues,
then the CPU may become slower compared with the 24 hardware/queues.

> effective_affinity, however, is always a single cpu, which can be any
> of the online ones set in smp_affinity.

> 
> If we can detect a cpu is spending too many cycles in nvme irq,
> perhaps we can re-spread the effective_affinity to different cpus. I
> don't currently know of a good way to do this, though.

That shouldn't have been done just for nvme irq given other storage
has similar issue.

I have tried toward that direction, and looks some cost has to be payed, such
as reading one time of local clock in interrupt or io path has to be done:

https://lore.kernel.org/lkml/20190827085344.30799-1-ming.lei@redhat.com/

It can be quite hard to figure out one efficient way to do that.


thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-12 22:55               ` Ming Lei
@ 2019-12-12 23:30                 ` Keith Busch
  2019-12-13  0:52                   ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-12 23:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: bigeasy, tglx, Sagi Grimberg, linux-nvme, hch

On Fri, Dec 13, 2019 at 06:55:52AM +0800, Ming Lei wrote:
> On Thu, Dec 12, 2019 at 10:02:40AM +0900, Keith Busch wrote:
> > On Wed, Dec 11, 2019 at 04:40:47PM -0800, Sagi Grimberg wrote:
> > > > Perhaps we can cycle the effective_affinity through the smp_affinity?
> > > 
> > > Not sure I follow your thoughts.
> > 
> > The only way the nvme driver's interrupt handler can saturate a
> > cpu requires the smp_affinity have multiple online cpus set. The
> 
> As Sagi mentioned, there can be 24 drives, each drive can be 1:1
> mapping, so each CPU may have to handle interrupts from 24 drives/queues,
> then the CPU may become slower compared with the 24 hardware/queues.

If it's 1:1 queue:cpu, then it doesn't matter how many devices you
have. Processes can't submit commands on that cpu while that cpu
is servicing device interrupts, but the device can't send interrupts
if processes can't submit commands.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 2/4] nvme/pci: Complete commands from primary handler
  2019-12-12 23:30                 ` Keith Busch
@ 2019-12-13  0:52                   ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-12-13  0:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, tglx, Sagi Grimberg, linux-nvme, hch

On Thu, Dec 12, 2019 at 04:30:39PM -0700, Keith Busch wrote:
> On Fri, Dec 13, 2019 at 06:55:52AM +0800, Ming Lei wrote:
> > On Thu, Dec 12, 2019 at 10:02:40AM +0900, Keith Busch wrote:
> > > On Wed, Dec 11, 2019 at 04:40:47PM -0800, Sagi Grimberg wrote:
> > > > > Perhaps we can cycle the effective_affinity through the smp_affinity?
> > > > 
> > > > Not sure I follow your thoughts.
> > > 
> > > The only way the nvme driver's interrupt handler can saturate a
> > > cpu requires the smp_affinity have multiple online cpus set. The
> > 
> > As Sagi mentioned, there can be 24 drives, each drive can be 1:1
> > mapping, so each CPU may have to handle interrupts from 24 drives/queues,
> > then the CPU may become slower compared with the 24 hardware/queues.
> 
> If it's 1:1 queue:cpu, then it doesn't matter how many devices you
> have. Processes can't submit commands on that cpu while that cpu
> is servicing device interrupts, but the device can't send interrupts
> if processes can't submit commands.
> 

OK, but still not sure it is good to cycle the effective_affinity through the
smp_affinity:

1) the top half(primary interrupt handler) still might saturate the CPU

1) task migration isn't cheap from one CPU to another one

2) in theory we only need to do that when one interrupt target CPU is saturated,
but changing thread's affinity has to be done beforehand, so this way may hurt
performance when the CPU is never saturated. I believe that should be usual case.


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-12  9:14   ` Christoph Hellwig
  2019-12-12 15:45     ` Keith Busch
@ 2019-12-18  7:29     ` Ming Lei
  2019-12-18 15:50       ` Keith Busch
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2019-12-18  7:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, bigeasy, tglx, sagi, linux-nvme

On Thu, Dec 12, 2019 at 10:14:33AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 10, 2019 at 02:56:21AM +0900, Keith Busch wrote:
> > The nvme threaded interrupt handling provides quick completions for
> > latency sensitive workloads, and threaded handlers for more IOPS intensive
> > ones. Remove the use_threaded_interrupts nvme module parameter since
> > leaving it disabled should not be providing a benefit.
> 
> I think we need some careful benchmarking and numbers to justify the
> switch.

The patch can fix CPU lockup on Azure's NVMe, however, IOPS drops to
~600K from 3M+.

Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-18  7:29     ` Ming Lei
@ 2019-12-18 15:50       ` Keith Busch
  2019-12-19  1:10         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Busch @ 2019-12-18 15:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: bigeasy, tglx, Christoph Hellwig, linux-nvme, sagi

On Wed, Dec 18, 2019 at 03:29:46PM +0800, Ming Lei wrote:
> On Thu, Dec 12, 2019 at 10:14:33AM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 10, 2019 at 02:56:21AM +0900, Keith Busch wrote:
> > > The nvme threaded interrupt handling provides quick completions for
> > > latency sensitive workloads, and threaded handlers for more IOPS intensive
> > > ones. Remove the use_threaded_interrupts nvme module parameter since
> > > leaving it disabled should not be providing a benefit.
> > 
> > I think we need some careful benchmarking and numbers to justify the
> > switch.
> 
> The patch can fix CPU lockup on Azure's NVMe, however, IOPS drops to
> ~600K from 3M+.

Compared to current mainline, or is that 3M+ with the poll completions
from submission?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts
  2019-12-18 15:50       ` Keith Busch
@ 2019-12-19  1:10         ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2019-12-19  1:10 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, tglx, Christoph Hellwig, linux-nvme, sagi

On Thu, Dec 19, 2019 at 12:50:50AM +0900, Keith Busch wrote:
> On Wed, Dec 18, 2019 at 03:29:46PM +0800, Ming Lei wrote:
> > On Thu, Dec 12, 2019 at 10:14:33AM +0100, Christoph Hellwig wrote:
> > > On Tue, Dec 10, 2019 at 02:56:21AM +0900, Keith Busch wrote:
> > > > The nvme threaded interrupt handling provides quick completions for
> > > > latency sensitive workloads, and threaded handlers for more IOPS intensive
> > > > ones. Remove the use_threaded_interrupts nvme module parameter since
> > > > leaving it disabled should not be providing a benefit.
> > > 
> > > I think we need some careful benchmarking and numbers to justify the
> > > switch.
> > 
> > The patch can fix CPU lockup on Azure's NVMe, however, IOPS drops to
> > ~600K from 3M+.
> 
> Compared to current mainline, or is that 3M+ with the poll completions
> from submission?

3M+ can be reached in mainline without poll completions from submission,
which only improves IOPS in single fio job test.

This one is multi-job & multi-drive test:

fio --bs=4k --ioengine=libaio --iodepth=32 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=400 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-09 17:56 [PATCHv3 0/4] nvme pci interrupt handling improvements Keith Busch
2019-12-09 17:56 ` [PATCHv3 1/4] nvme/pci: Disable interrupts for threaded handler Keith Busch
2019-12-10 15:12   ` Daniel Wagner
2019-12-10 15:28     ` Sebastian Andrzej Siewior
2019-12-10 15:54       ` Keith Busch
2019-12-10 16:44         ` Daniel Wagner
2019-12-10 16:57           ` Keith Busch
2019-12-10 17:11             ` Daniel Wagner
2019-12-12  9:09   ` Christoph Hellwig
2019-12-12 15:53     ` Keith Busch
2019-12-09 17:56 ` [PATCHv3 2/4] nvme/pci: Complete commands from primary handler Keith Busch
2019-12-10 20:00   ` Sagi Grimberg
2019-12-10 20:25     ` Keith Busch
2019-12-10 21:14       ` Sagi Grimberg
2019-12-11 17:35         ` Keith Busch
2019-12-12  0:40           ` Sagi Grimberg
2019-12-12  1:02             ` Keith Busch
2019-12-12 22:55               ` Ming Lei
2019-12-12 23:30                 ` Keith Busch
2019-12-13  0:52                   ` Ming Lei
2019-12-12  9:14   ` Christoph Hellwig
2019-12-09 17:56 ` [PATCHv3 3/4] nvme/pci: Remove use_threaded_interrupts Keith Busch
2019-12-12  9:14   ` Christoph Hellwig
2019-12-12 15:45     ` Keith Busch
2019-12-18  7:29     ` Ming Lei
2019-12-18 15:50       ` Keith Busch
2019-12-19  1:10         ` Ming Lei
2019-12-09 17:56 ` [PATCHv3 4/4] nvme/pci: Poll threaded completions Keith Busch
2019-12-10 17:43   ` Daniel Wagner

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git