linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] nvme: Threaded interrupt handling improvements
@ 2019-11-27 17:58 Keith Busch
  2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
                   ` (5 more replies)
  0 siblings, 6 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-27 17:58 UTC (permalink / raw)
  To: linux-nvme, hch, sagi; +Cc: Keith Busch, bigeasy, helgaas, ming.lei

Threaded interrupts allow the device to continue sending interrupt
messages while the driver is handling the previous notification. This
can cause a significant amount of CPU cycles unnecessarily spent in hard
irq context, and potentially triggers spurious interrupt detection to
disable the nvme interrupt.

Use the appropriate masking for the interrupt type based on the NVMe
specification recommendations (see NVMe 1.4 section 7.5.1.1 for more
information).

The first patch just exports the fast MSIx masking so that low-depth
workloads don't suffer so much when using threaded interrupts.

The next two use the interrupt masking on the device for the different
types of interrupts.

The last patch is a performance improvement for high-depth workloads.

Keith Busch (4):
  PCI/MSI: Export __pci_msix_desc_mask_irq
  nvme/pci: Mask legacy and MSI in threaded handler
  nvme/pci: Mask MSIx interrupts for threaded handling
  nvme/pci: Spin threaded interrupt completions

 drivers/nvme/host/pci.c | 50 +++++++++++++++++++++++++++++++++++++++--
 drivers/pci/msi.c       |  1 +
 2 files changed, 49 insertions(+), 2 deletions(-)

-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
@ 2019-11-27 17:58 ` Keith Busch
  2019-11-28  2:42   ` Sagi Grimberg
  2019-11-28  7:17   ` Christoph Hellwig
  2019-11-27 17:58 ` [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler Keith Busch
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-27 17:58 UTC (permalink / raw)
  To: linux-nvme, hch, sagi; +Cc: Keith Busch, bigeasy, helgaas, ming.lei

Export the fast msix mask so that drivers may use these in timing
sensitive contexts.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/msi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 0884bedcfc7a..9e866929f4b0 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -225,6 +225,7 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
 
 	return mask_bits;
 }
+EXPORT_SYMBOL_GPL(__pci_msix_desc_mask_irq);
 
 static void msix_mask_irq(struct msi_desc *desc, u32 flag)
 {
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
  2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
@ 2019-11-27 17:58 ` Keith Busch
  2019-11-28  3:39   ` Ming Lei
  2019-11-27 17:58 ` [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling Keith Busch
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-27 17:58 UTC (permalink / raw)
  To: linux-nvme, hch, sagi; +Cc: Keith Busch, bigeasy, helgaas, ming.lei

Local interrupts are re-enabled when the nvme irq thread is
woken. Subsequent MSI or a level triggered legacy interrupts may restart
the nvme irq check while the thread handler is running. This unnecessarily
spends CPU cycles and potentially triggers spurious interrupt detection,
disabling our NVMe irq.

Use the NVMe interrupt mask/clear registers to disable controller
interrupts while the nvme bottom half processes completions.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9d307593b94f..c5b837cba730 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1048,6 +1048,28 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
+static irqreturn_t nvme_irq_thread_msi(int irq, void *data)
+{
+	struct nvme_queue *nvmeq = data;
+	struct nvme_dev *dev = nvmeq->dev;
+
+	nvme_irq(irq, data);
+	writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMC);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t nvme_irq_check_msi(int irq, void *data)
+{
+	struct nvme_queue *nvmeq = data;
+	struct nvme_dev *dev = nvmeq->dev;
+
+	if (nvme_cqe_pending(nvmeq)) {
+		writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMS);
+		return IRQ_WAKE_THREAD;
+	}
+	return IRQ_NONE;
+}
+
 /*
  * Poll for completions any queue, including those not dedicated to polling.
  * Can be called from any context.
@@ -1502,6 +1524,11 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 	int nr = nvmeq->dev->ctrl.instance;
 
 	if (use_threaded_interrupts) {
+		/* MSI and Legacy use the same NVMe IRQ masking */
+		if (!pdev->msix_enabled)
+			return pci_request_irq(pdev, nvmeq->cq_vector,
+				nvme_irq_check_msi, nvme_irq_thread_msi,
+				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
 				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	} else {
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
  2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
  2019-11-27 17:58 ` [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler Keith Busch
@ 2019-11-27 17:58 ` Keith Busch
  2019-11-28  7:19   ` Christoph Hellwig
  2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-27 17:58 UTC (permalink / raw)
  To: linux-nvme, hch, sagi; +Cc: Keith Busch, bigeasy, helgaas, ming.lei

The nvme irq thread, when enabled, may run for a while as new completions
are submitted. These completions may also send MSI messages, which could
be detected as spurious and disables the nvme irq.

Use the fast MSIx mask to disable the controller from sending MSIx
interrupts while the nvme bottom half thread is running.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c5b837cba730..571b33b69c5f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -13,11 +13,13 @@
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
+#include <linux/irq.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/once.h>
 #include <linux/pci.h>
+#include <linux/msi.h>
 #include <linux/suspend.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
@@ -1040,11 +1042,21 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
+static irqreturn_t nvme_irq_thread(int irq, void *data)
+{
+	nvme_irq(irq, data);
+	__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
+	return IRQ_HANDLED;
+}
+
 static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	if (nvme_cqe_pending(nvmeq))
+
+	if (nvme_cqe_pending(nvmeq)) {
+		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
 		return IRQ_WAKE_THREAD;
+	}
 	return IRQ_NONE;
 }
 
@@ -1530,7 +1542,8 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 				nvme_irq_check_msi, nvme_irq_thread_msi,
 				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+				nvme_irq_thread, nvmeq, "nvme%dq%d", nr,
+				nvmeq->qid);
 	} else {
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
 				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
                   ` (2 preceding siblings ...)
  2019-11-27 17:58 ` [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling Keith Busch
@ 2019-11-27 17:58 ` Keith Busch
  2019-11-28  2:46   ` Sagi Grimberg
                     ` (2 more replies)
  2019-11-28  7:50 ` [PATCH 0/4] nvme: Threaded interrupt handling improvements Christoph Hellwig
  2019-11-29  9:46 ` Sebastian Andrzej Siewior
  5 siblings, 3 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-27 17:58 UTC (permalink / raw)
  To: linux-nvme, hch, sagi; +Cc: Keith Busch, bigeasy, helgaas, ming.lei

For deeply queued workloads, the nvme controller may be posting
new completions while the threaded interrupt handles previous
completions. Since the interrupts are masked, we can spin for these
completions for as long as new completions are being posted.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 571b33b69c5f..9ec0933eb120 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
+static void nvme_irq_spin(int irq, void *data)
+{
+	while (nvme_irq(irq, data) != IRQ_NONE)
+		cond_resched();
+}
+
 static irqreturn_t nvme_irq_thread(int irq, void *data)
 {
-	nvme_irq(irq, data);
+	nvme_irq_spin(irq, data);
 	__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
 	return IRQ_HANDLED;
 }
@@ -1065,7 +1071,7 @@ static irqreturn_t nvme_irq_thread_msi(int irq, void *data)
 	struct nvme_queue *nvmeq = data;
 	struct nvme_dev *dev = nvmeq->dev;
 
-	nvme_irq(irq, data);
+	nvme_irq_spin(irq, data);
 	writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMC);
 	return IRQ_HANDLED;
 }
-- 
2.21.0


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq
  2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
@ 2019-11-28  2:42   ` Sagi Grimberg
  2019-11-28  3:41     ` Keith Busch
  2019-11-28  7:17   ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Sagi Grimberg @ 2019-11-28  2:42 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch; +Cc: bigeasy, helgaas, ming.lei

> Export the fast msix mask so that drivers may use these in timing
> sensitive contexts.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/pci/msi.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 0884bedcfc7a..9e866929f4b0 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -225,6 +225,7 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
>   
>   	return mask_bits;
>   }
> +EXPORT_SYMBOL_GPL(__pci_msix_desc_mask_irq);
>   
>   static void msix_mask_irq(struct msi_desc *desc, u32 flag)
>   {
> 

Nice!

but why not export msix_mask_irq?

Is it possible that this what made the irqpoll patch I sent
to cause a performance hit? I used disable_irq_nosync...

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
@ 2019-11-28  2:46   ` Sagi Grimberg
  2019-11-28  3:28     ` Keith Busch
  2019-11-28  7:22   ` Christoph Hellwig
  2019-11-29  9:13   ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 37+ messages in thread
From: Sagi Grimberg @ 2019-11-28  2:46 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch; +Cc: bigeasy, helgaas, ming.lei

> For deeply queued workloads, the nvme controller may be posting
> new completions while the threaded interrupt handles previous
> completions. Since the interrupts are masked, we can spin for these
> completions for as long as new completions are being posted.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/nvme/host/pci.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 571b33b69c5f..9ec0933eb120 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
>   	return ret;
>   }
>   
> +static void nvme_irq_spin(int irq, void *data)
> +{
> +	while (nvme_irq(irq, data) != IRQ_NONE)
> +		cond_resched();

So the cond_resched should be fair to multiple devices mapped to the
same cpu core I assume.. did you happen to test it?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-28  2:46   ` Sagi Grimberg
@ 2019-11-28  3:28     ` Keith Busch
  2019-11-28  3:51       ` Ming Lei
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-28  3:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: bigeasy, helgaas, hch, linux-nvme, ming.lei

On Wed, Nov 27, 2019 at 06:46:55PM -0800, Sagi Grimberg wrote:
> > For deeply queued workloads, the nvme controller may be posting
> > new completions while the threaded interrupt handles previous
> > completions. Since the interrupts are masked, we can spin for these
> > completions for as long as new completions are being posted.
> > 
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > ---
> >   drivers/nvme/host/pci.c | 10 ++++++++--
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 571b33b69c5f..9ec0933eb120 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
> >   	return ret;
> >   }
> > +static void nvme_irq_spin(int irq, void *data)
> > +{
> > +	while (nvme_irq(irq, data) != IRQ_NONE)
> > +		cond_resched();
> 
> So the cond_resched should be fair to multiple devices mapped to the
> same cpu core I assume.. did you happen to test it?

It should, but I'm having difficulty expressly testing that. Frequent
spinning here needs a single queue mapped to multiple CPUs, such that
one or more CPUs can constantly dispatch new requests. I've one test
where this spin never exits for the entire duration of an fio execution,
and /proc/interrupts confirms only 1 interrupt occured for many millions
of IO.

When you have two or more devices with queues mapped to multiple CPUs,
their threaded interrupt handler affinities will not share the same CPU.

When we have per-cpu queues, all the devices' thread affinity will be
the same, but the while loop usually spins around only a couple times
because the submission side is sharing that same CPU. This naturally
throttles the number of completions the irq thread can observe, so the
thread ends up scheduling itself out without the cond_resched().

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-27 17:58 ` [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler Keith Busch
@ 2019-11-28  3:39   ` Ming Lei
  2019-11-28  3:48     ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2019-11-28  3:39 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, helgaas, Thomas Gleixner, hch

On Thu, Nov 28, 2019 at 02:58:22AM +0900, Keith Busch wrote:
> Local interrupts are re-enabled when the nvme irq thread is
> woken. Subsequent MSI or a level triggered legacy interrupts may restart
> the nvme irq check while the thread handler is running. This unnecessarily
> spends CPU cycles and potentially triggers spurious interrupt detection,
> disabling our NVMe irq.
> 
> Use the NVMe interrupt mask/clear registers to disable controller
> interrupts while the nvme bottom half processes completions.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9d307593b94f..c5b837cba730 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1048,6 +1048,28 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>  	return IRQ_NONE;
>  }
>  
> +static irqreturn_t nvme_irq_thread_msi(int irq, void *data)
> +{
> +	struct nvme_queue *nvmeq = data;
> +	struct nvme_dev *dev = nvmeq->dev;
> +
> +	nvme_irq(irq, data);
> +	writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMC);
> +	return IRQ_HANDLED;
> +}
> +
> +static irqreturn_t nvme_irq_check_msi(int irq, void *data)
> +{
> +	struct nvme_queue *nvmeq = data;
> +	struct nvme_dev *dev = nvmeq->dev;
> +
> +	if (nvme_cqe_pending(nvmeq)) {
> +		writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMS);
> +		return IRQ_WAKE_THREAD;
> +	}
> +	return IRQ_NONE;
> +}
> +
>  /*
>   * Poll for completions any queue, including those not dedicated to polling.
>   * Can be called from any context.
> @@ -1502,6 +1524,11 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  	int nr = nvmeq->dev->ctrl.instance;
>  
>  	if (use_threaded_interrupts) {
> +		/* MSI and Legacy use the same NVMe IRQ masking */
> +		if (!pdev->msix_enabled)
> +			return pci_request_irq(pdev, nvmeq->cq_vector,
> +				nvme_irq_check_msi, nvme_irq_thread_msi,
> +				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>  		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
>  				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);

Just wondering why don't do that for misx_enabled, and according to
document of request_threaded_irq(), the handler is supposed to
disable the device's interrupt:

 *      If you want to set up a threaded irq handler for your device
 *      then you need to supply @handler and @thread_fn. @handler is
 *      still called in hard interrupt context and has to check
 *      whether the interrupt originates from the device. If yes it
 *      needs to disable the interrupt on the device and return
 *      IRQ_WAKE_THREAD which will wake up the handler thread and run
 *      @thread_fn. This split handler design is necessary to support
 *      shared interrupts.

However, MSI irq controller is said to be one shot safe, see
923aa4c378f9("PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips"),
then the question is that if interrupt mask is needed.


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq
  2019-11-28  2:42   ` Sagi Grimberg
@ 2019-11-28  3:41     ` Keith Busch
  0 siblings, 0 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-28  3:41 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: bigeasy, helgaas, hch, linux-nvme, ming.lei

On Wed, Nov 27, 2019 at 06:42:06PM -0800, Sagi Grimberg wrote:
> > Export the fast msix mask so that drivers may use these in timing
> > sensitive contexts.
> > 
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > ---
> >   drivers/pci/msi.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 0884bedcfc7a..9e866929f4b0 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -225,6 +225,7 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
> >   	return mask_bits;
> >   }
> > +EXPORT_SYMBOL_GPL(__pci_msix_desc_mask_irq);
> >   static void msix_mask_irq(struct msi_desc *desc, u32 flag)
> >   {
> > 
> 
> Nice!
> 
> but why not export msix_mask_irq?

Only because __pci_msix_desc_mask_irq() was already defined in a public
header. :)
 
> Is it possible that this what made the irqpoll patch I sent
> to cause a performance hit? I used disable_irq_nosync...

I think so, yes, that's a relatively costly operation and would have a
measurable performance hit if called frequently. The exception might be
if you can setup a test such that irqpoll always observes completions
such that it rarely or never has to unmask interrupts.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-28  3:39   ` Ming Lei
@ 2019-11-28  3:48     ` Keith Busch
  2019-11-28  3:58       ` Ming Lei
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-28  3:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: sagi, bigeasy, linux-nvme, helgaas, Thomas Gleixner, hch

On Thu, Nov 28, 2019 at 11:39:56AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 02:58:22AM +0900, Keith Busch wrote:
> > @@ -1502,6 +1524,11 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
> >  	int nr = nvmeq->dev->ctrl.instance;
> >  
> >  	if (use_threaded_interrupts) {
> > +		/* MSI and Legacy use the same NVMe IRQ masking */
> > +		if (!pdev->msix_enabled)
> > +			return pci_request_irq(pdev, nvmeq->cq_vector,
> > +				nvme_irq_check_msi, nvme_irq_thread_msi,
> > +				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> >  		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> >  				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> 
> Just wondering why don't do that for misx_enabled, and according to
> document of request_threaded_irq(), the handler is supposed to
> disable the device's interrupt:

MSI-x is handled in patch 3/4. I just split the two since the mechanisms
they use to mask interrupts are very different from each other.
 
> 923aa4c378f9("PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips"),
> then the question is that if interrupt mask is needed.

We don't want to use IRQF_ONESHOT for our MSI interrupts because that
will write to the MSI mask config register, which is a costly non-posted
transaction. The NVMe specific way uses much faster posted writes.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-28  3:28     ` Keith Busch
@ 2019-11-28  3:51       ` Ming Lei
  2019-11-28  3:58         ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2019-11-28  3:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: bigeasy, helgaas, Sagi Grimberg, linux-nvme, hch

On Thu, Nov 28, 2019 at 12:28:43PM +0900, Keith Busch wrote:
> On Wed, Nov 27, 2019 at 06:46:55PM -0800, Sagi Grimberg wrote:
> > > For deeply queued workloads, the nvme controller may be posting
> > > new completions while the threaded interrupt handles previous
> > > completions. Since the interrupts are masked, we can spin for these
> > > completions for as long as new completions are being posted.
> > > 
> > > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > > ---
> > >   drivers/nvme/host/pci.c | 10 ++++++++--
> > >   1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 571b33b69c5f..9ec0933eb120 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
> > >   	return ret;
> > >   }
> > > +static void nvme_irq_spin(int irq, void *data)
> > > +{
> > > +	while (nvme_irq(irq, data) != IRQ_NONE)
> > > +		cond_resched();
> > 
> > So the cond_resched should be fair to multiple devices mapped to the
> > same cpu core I assume.. did you happen to test it?
> 
> It should, but I'm having difficulty expressly testing that. Frequent
> spinning here needs a single queue mapped to multiple CPUs, such that
> one or more CPUs can constantly dispatch new requests. I've one test
> where this spin never exits for the entire duration of an fio execution,
> and /proc/interrupts confirms only 1 interrupt occured for many millions
> of IO.
> 
> When you have two or more devices with queues mapped to multiple CPUs,
> their threaded interrupt handler affinities will not share the same CPU.

They still may share same CPU if there are many enough NVMe drives, the
threaded interrupt handler actually takes the effective hard interrupt's
affinity, which still may point to same CPU, in case of:

	nr_nvme_drives * nr_hw_queues > nr_cpus


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-28  3:51       ` Ming Lei
@ 2019-11-28  3:58         ` Keith Busch
  0 siblings, 0 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-28  3:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: bigeasy, helgaas, Sagi Grimberg, linux-nvme, hch

On Thu, Nov 28, 2019 at 11:51:39AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 12:28:43PM +0900, Keith Busch wrote:
> > 
> > When you have two or more devices with queues mapped to multiple CPUs,
> > their threaded interrupt handler affinities will not share the same CPU.
> 
> They still may share same CPU if there are many enough NVMe drives, the
> threaded interrupt handler actually takes the effective hard interrupt's
> affinity, which still may point to same CPU, in case of:
> 
> 	nr_nvme_drives * nr_hw_queues > nr_cpus

Yeah, that's true. I'm just a bit constrained on devices, so having some
difficulty testing Sagi's scenario. I assume the cond_resched() provides
appropriate fairness here, but will need more time to instrument such a test.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-28  3:48     ` Keith Busch
@ 2019-11-28  3:58       ` Ming Lei
  2019-11-28  4:14         ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2019-11-28  3:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, helgaas, Thomas Gleixner, hch

On Thu, Nov 28, 2019 at 12:48:17PM +0900, Keith Busch wrote:
> On Thu, Nov 28, 2019 at 11:39:56AM +0800, Ming Lei wrote:
> > On Thu, Nov 28, 2019 at 02:58:22AM +0900, Keith Busch wrote:
> > > @@ -1502,6 +1524,11 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
> > >  	int nr = nvmeq->dev->ctrl.instance;
> > >  
> > >  	if (use_threaded_interrupts) {
> > > +		/* MSI and Legacy use the same NVMe IRQ masking */
> > > +		if (!pdev->msix_enabled)
> > > +			return pci_request_irq(pdev, nvmeq->cq_vector,
> > > +				nvme_irq_check_msi, nvme_irq_thread_msi,
> > > +				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> > >  		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> > >  				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> > 
> > Just wondering why don't do that for misx_enabled, and according to
> > document of request_threaded_irq(), the handler is supposed to
> > disable the device's interrupt:
> 
> MSI-x is handled in patch 3/4. I just split the two since the mechanisms
> they use to mask interrupts are very different from each other.

Fine.

>  
> > 923aa4c378f9("PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips"),
> > then the question is that if interrupt mask is needed.
> 
> We don't want to use IRQF_ONESHOT for our MSI interrupts because that
> will write to the MSI mask config register, which is a costly non-posted
> transaction. The NVMe specific way uses much faster posted writes.

What I meant is that IRQF_ONESHOT isn't needed in case of IRQCHIP_ONESHOT_SAFE.

So it is reasonable to understand that interrupt mask isn't needed in the
hard interrupt handler in case of IRQCHIP_ONESHOT_SAFE. That is
basically what commit dc9b229a58dc("genirq: Allow irq chips to mark themself
oneshot safe") does.

Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-28  3:58       ` Ming Lei
@ 2019-11-28  4:14         ` Keith Busch
  2019-11-28  8:41           ` Ming Lei
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-28  4:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: sagi, bigeasy, linux-nvme, helgaas, Thomas Gleixner, hch

On Thu, Nov 28, 2019 at 11:58:53AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 12:48:17PM +0900, Keith Busch wrote:
> > On Thu, Nov 28, 2019 at 11:39:56AM +0800, Ming Lei wrote:
> > > 923aa4c378f9("PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips"),
> > > then the question is that if interrupt mask is needed.
> > 
> > We don't want to use IRQF_ONESHOT for our MSI interrupts because that
> > will write to the MSI mask config register, which is a costly non-posted
> > transaction. The NVMe specific way uses much faster posted writes.
> 
> What I meant is that IRQF_ONESHOT isn't needed in case of IRQCHIP_ONESHOT_SAFE.
> 
> So it is reasonable to understand that interrupt mask isn't needed in the
> hard interrupt handler in case of IRQCHIP_ONESHOT_SAFE. That is
> basically what commit dc9b229a58dc("genirq: Allow irq chips to mark themself
> oneshot safe") does.

Hmm, it doesn't look like it's always safe. We have to stop the device
from generating MSIs for new completions somehow while the threaded
handler is running, otherwise those MSIs will be considered spurious
when the thread never gets a chance to increment desc->threads_handled.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq
  2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
  2019-11-28  2:42   ` Sagi Grimberg
@ 2019-11-28  7:17   ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-11-28  7:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, tglx, hch

On Thu, Nov 28, 2019 at 02:58:21AM +0900, Keith Busch wrote:
> Export the fast msix mask so that drivers may use these in timing
> sensitive contexts.

Always good to add Thomas for interesting irq handling bits..

I'm not sure directly calling this API is a good idea, as it basicall
ends up being called through the various irq_chip drivers.

I think we really need a way to communicate down the disable_irq
path that we don't need to flush MMIO writes instead.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling
  2019-11-27 17:58 ` [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling Keith Busch
@ 2019-11-28  7:19   ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-11-28  7:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, hch

On Thu, Nov 28, 2019 at 02:58:23AM +0900, Keith Busch wrote:
> The nvme irq thread, when enabled, may run for a while as new completions
> are submitted. These completions may also send MSI messages, which could
> be detected as spurious and disables the nvme irq.
> 
> Use the fast MSIx mask to disable the controller from sending MSIx
> interrupts while the nvme bottom half thread is running.

I thin kwe should keep this together with patch 2, not just in the
patch series but also in the code - I'd rather have two little helper
to enable/disable an irq with if for the two cases and a good comment
right next to that than splitting the higher level functions.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
  2019-11-28  2:46   ` Sagi Grimberg
@ 2019-11-28  7:22   ` Christoph Hellwig
  2019-11-29  9:13   ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-11-28  7:22 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, hch

I'd rather open code the loop as that simplifies a few things.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
                   ` (3 preceding siblings ...)
  2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
@ 2019-11-28  7:50 ` Christoph Hellwig
  2019-11-28 17:59   ` Keith Busch
  2019-11-29  9:46 ` Sebastian Andrzej Siewior
  5 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2019-11-28  7:50 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, hch

FYI, this is how I'd imagine my comments to look like on top of your
tree, modulo the posted interrupt disabling part that will need
changes outside nvme.  If we want to be fancy we can split the irq
disable/enable into separate helpers.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e62fede7d4e4..1d6a222ddcc3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1042,50 +1042,45 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	return ret;
 }
 
-static void nvme_irq_spin(int irq, void *data)
-{
-	while (nvme_irq(irq, data) != IRQ_NONE)
-		cond_resched();
-}
-
 static irqreturn_t nvme_irq_thread(int irq, void *data)
-{
-	nvme_irq_spin(irq, data);
-	__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
-	return IRQ_HANDLED;
-}
-
-static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
+	u16 start, end;
 
-	if (nvme_cqe_pending(nvmeq)) {
-		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
-		return IRQ_WAKE_THREAD;
+	/*
+	 * The rmb/wmb pair ensures we see all updates from a previous run of
+	 * the irq thread, even if that was on another CPU.
+	 */
+	rmb();
+	for (;;) {
+		nvme_process_cq(nvmeq, &start, &end, -1);
+		nvmeq->last_cq_head = nvmeq->cq_head;
+		if (start == end)
+			break;
+		nvme_complete_cqes(nvmeq, start, end);
+		cond_resched();
 	}
-	return IRQ_NONE;
-}
-
-static irqreturn_t nvme_irq_thread_msi(int irq, void *data)
-{
-	struct nvme_queue *nvmeq = data;
-	struct nvme_dev *dev = nvmeq->dev;
+	wmb();
 
-	nvme_irq_spin(irq, data);
-	writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMC);
+	if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
+		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
+	else
+		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC);
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t nvme_irq_check_msi(int irq, void *data)
+static irqreturn_t nvme_irq_check(int irq, void *data)
 {
 	struct nvme_queue *nvmeq = data;
-	struct nvme_dev *dev = nvmeq->dev;
 
-	if (nvme_cqe_pending(nvmeq)) {
-		writel(1 << nvmeq->cq_vector, dev->bar + NVME_REG_INTMS);
-		return IRQ_WAKE_THREAD;
-	}
-	return IRQ_NONE;
+	if (!nvme_cqe_pending(nvmeq))
+		return IRQ_NONE;
+
+	if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
+		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
+	else
+		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
+	return IRQ_WAKE_THREAD;
 }
 
 /*
@@ -1542,11 +1537,6 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 	int nr = nvmeq->dev->ctrl.instance;
 
 	if (use_threaded_interrupts) {
-		/* MSI and Legacy use the same NVMe IRQ masking */
-		if (!pdev->msix_enabled)
-			return pci_request_irq(pdev, nvmeq->cq_vector,
-				nvme_irq_check_msi, nvme_irq_thread_msi,
-				nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
 				nvme_irq_thread, nvmeq, "nvme%dq%d", nr,
 				nvmeq->qid);

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler
  2019-11-28  4:14         ` Keith Busch
@ 2019-11-28  8:41           ` Ming Lei
  0 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2019-11-28  8:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, helgaas, Thomas Gleixner, hch

On Thu, Nov 28, 2019 at 01:14:04PM +0900, Keith Busch wrote:
> On Thu, Nov 28, 2019 at 11:58:53AM +0800, Ming Lei wrote:
> > On Thu, Nov 28, 2019 at 12:48:17PM +0900, Keith Busch wrote:
> > > On Thu, Nov 28, 2019 at 11:39:56AM +0800, Ming Lei wrote:
> > > > 923aa4c378f9("PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips"),
> > > > then the question is that if interrupt mask is needed.
> > > 
> > > We don't want to use IRQF_ONESHOT for our MSI interrupts because that
> > > will write to the MSI mask config register, which is a costly non-posted
> > > transaction. The NVMe specific way uses much faster posted writes.
> > 
> > What I meant is that IRQF_ONESHOT isn't needed in case of IRQCHIP_ONESHOT_SAFE.
> > 
> > So it is reasonable to understand that interrupt mask isn't needed in the
> > hard interrupt handler in case of IRQCHIP_ONESHOT_SAFE. That is
> > basically what commit dc9b229a58dc("genirq: Allow irq chips to mark themself
> > oneshot safe") does.
> 
> Hmm, it doesn't look like it's always safe. We have to stop the device
> from generating MSIs for new completions somehow while the threaded
> handler is running, otherwise those MSIs will be considered spurious
> when the thread never gets a chance to increment desc->threads_handled.
> 

I just observe hard interrupts triggered between start of nvme_irq_check()
and end of nvme_irq(). Yeah, there could be at most 36 interrupts comes during
the period on one machine, and most are <= 5.

So looks this patchset make sense, and also means that IRQCHIP_ONESHOT_SAFE
might be broken.


@interrupts_during_threaded:
[0, 1)           2074375 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[1, 2)           3668018 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 3)           1508944 |@@@@@@@@@@@@@@@@@@@@@                               |
[3, 4)            554496 |@@@@@@@                                             |
[4, 5)            225761 |@@@                                                 |
[5, 6)             99354 |@                                                   |
[6, 7)             45127 |                                                    |
[7, 8)             20912 |                                                    |
[8, 9)              9940 |                                                    |
[9, 10)             4765 |                                                    |
[10, 11)            2458 |                                                    |
[11, 12)            1365 |                                                    |
[12, 13)             719 |                                                    |
[13, 14)             451 |                                                    |
[14, 15)             265 |                                                    |
[15, 16)             168 |                                                    |
[16, 17)             103 |                                                    |
[17, 18)              67 |                                                    |
[18, 19)              60 |                                                    |
[19, 20)              41 |                                                    |
[20, 21)              27 |                                                    |
[21, 22)              18 |                                                    |
[22, 23)              17 |                                                    |
[23, 24)               8 |                                                    |
[24, 25)               2 |                                                    |
[25, 26)               9 |                                                    |
[26, 27)               6 |                                                    |
[27, 28)               3 |                                                    |
[28, 29)               1 |                                                    |
[29, 30)               0 |                                                    |
[30, 31)               0 |                                                    |
[31, 32)               0 |                                                    |
[32, 33)               0 |                                                    |
[33, 34)               0 |                                                    |
[34, 35)               0 |                                                    |
[35, 36)               0 |                                                    |
[36, 37)               1 |                                                    |


Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-28  7:50 ` [PATCH 0/4] nvme: Threaded interrupt handling improvements Christoph Hellwig
@ 2019-11-28 17:59   ` Keith Busch
  2019-11-29  8:30     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-28 17:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bigeasy, helgaas, sagi, linux-nvme, ming.lei

On Thu, Nov 28, 2019 at 08:50:47AM +0100, Christoph Hellwig wrote:
> +	if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
> +		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
> +	else
> +		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);

Oh, we know which branch this would take before we register the callback,
so smaller specialized functions don't need if/else checks. I even want
to remove the shadow doorbell checks for normal devices by giving each
different submission/completion callbacks too. They are individually
unmeasurable, but maybe they'll add up! :D

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-28 17:59   ` Keith Busch
@ 2019-11-29  8:30     ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-11-29  8:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, Christoph Hellwig

On Fri, Nov 29, 2019 at 02:59:04AM +0900, Keith Busch wrote:
> On Thu, Nov 28, 2019 at 08:50:47AM +0100, Christoph Hellwig wrote:
> > +	if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
> > +		__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
> > +	else
> > +		writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS);
> 
> Oh, we know which branch this would take before we register the callback,
> so smaller specialized functions don't need if/else checks. I even want
> to remove the shadow doorbell checks for normal devices by giving each
> different submission/completion callbacks too. They are individually
> unmeasurable, but maybe they'll add up! :D

This branch is perfect fodder for the branch predictor.  I see aboslutely
no reason to osbfucate the code to get rid of it.  We also actually remove
more branches than that by open coding nvme_irq for the threaded irq
case than are added here.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
  2019-11-28  2:46   ` Sagi Grimberg
  2019-11-28  7:22   ` Christoph Hellwig
@ 2019-11-29  9:13   ` Sebastian Andrzej Siewior
  2019-11-30 18:10     ` Keith Busch
  2 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-29  9:13 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On 2019-11-28 02:58:24 [+0900], Keith Busch wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 571b33b69c5f..9ec0933eb120 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
>  	return ret;
>  }
>  
> +static void nvme_irq_spin(int irq, void *data)
> +{
> +	while (nvme_irq(irq, data) != IRQ_NONE)
> +		cond_resched();
> +}

That interrupt thread runs at SCHED_FIFO prio 50 by default. You will
not get anything with a lower priority running (including SCHED_OTHER).
You won't get preempted by another FIFO thread at prio 50 so I *think*
that cond_rechsched() won't let you schedule another task/IRQ thread
running at prio 50 either.

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
                   ` (4 preceding siblings ...)
  2019-11-28  7:50 ` [PATCH 0/4] nvme: Threaded interrupt handling improvements Christoph Hellwig
@ 2019-11-29  9:46 ` Sebastian Andrzej Siewior
  2019-11-29 16:27   ` Keith Busch
  5 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-29  9:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On 2019-11-28 02:58:20 [+0900], Keith Busch wrote:
> Threaded interrupts allow the device to continue sending interrupt
> messages while the driver is handling the previous notification. This
> can cause a significant amount of CPU cycles unnecessarily spent in hard
> irq context, and potentially triggers spurious interrupt detection to
> disable the nvme interrupt.

Thank you for looking into this.
To be clear: the "spurious interrupt detector" won't do a thing if you
never return IRQ_NONE. As long as you return IRQ_HANDLED, everything is
fine.

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-29  9:46 ` Sebastian Andrzej Siewior
@ 2019-11-29 16:27   ` Keith Busch
  2019-11-29 17:05     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-29 16:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On Fri, Nov 29, 2019 at 10:46:40AM +0100, Sebastian Andrzej Siewior wrote:
> On 2019-11-28 02:58:20 [+0900], Keith Busch wrote:
> > Threaded interrupts allow the device to continue sending interrupt
> > messages while the driver is handling the previous notification. This
> > can cause a significant amount of CPU cycles unnecessarily spent in hard
> > irq context, and potentially triggers spurious interrupt detection to
> > disable the nvme interrupt.
> 
> Thank you for looking into this.
> To be clear: the "spurious interrupt detector" won't do a thing if you
> never return IRQ_NONE. As long as you return IRQ_HANDLED, everything is
> fine.

That's not entirely accurate. We have to return IRQ_WAKE_THREAD
from the hardirq handler, which gets converted to IRQ_NONE if
desc->threads_handled hasn't changed and considered spurious. If
the threaded handler runs sufficiently long, desc->threads_handled
won't get updated frequently enough, so this series fixes that for
nvme by masking the interrupt in the device, preventing future hard
irq callbacks.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-29 16:27   ` Keith Busch
@ 2019-11-29 17:05     ` Sebastian Andrzej Siewior
  2019-11-30 17:02       ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-29 17:05 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On 2019-11-29 09:27:19 [-0700], Keith Busch wrote:
> On Fri, Nov 29, 2019 at 10:46:40AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2019-11-28 02:58:20 [+0900], Keith Busch wrote:
> > > Threaded interrupts allow the device to continue sending interrupt
> > > messages while the driver is handling the previous notification. This
> > > can cause a significant amount of CPU cycles unnecessarily spent in hard
> > > irq context, and potentially triggers spurious interrupt detection to
> > > disable the nvme interrupt.
> > 
> > Thank you for looking into this.
> > To be clear: the "spurious interrupt detector" won't do a thing if you
> > never return IRQ_NONE. As long as you return IRQ_HANDLED, everything is
> > fine.
> 
> That's not entirely accurate. We have to return IRQ_WAKE_THREAD
> from the hardirq handler, which gets converted to IRQ_NONE if
> desc->threads_handled hasn't changed and considered spurious. If
> the threaded handler runs sufficiently long, desc->threads_handled
> won't get updated frequently enough, so this series fixes that for
> nvme by masking the interrupt in the device, preventing future hard
> irq callbacks.

Ach okay. But I wouldn't consider it as a "bug" if your threaded handler
returns IRQ_HANDLED and requires a longer period of time. Thsi seems
fine to me.
The SDHCI driver has a case where the interrupt thread waits for another
interrupt (from its primary handler) in order to make progress.

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-29 17:05     ` Sebastian Andrzej Siewior
@ 2019-11-30 17:02       ` Keith Busch
  2019-12-02 17:05         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-11-30 17:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On Fri, Nov 29, 2019 at 06:05:45PM +0100, Sebastian Andrzej Siewior wrote:
> Ach okay. But I wouldn't consider it as a "bug" if your threaded handler
> returns IRQ_HANDLED and requires a longer period of time. Thsi seems
> fine to me.
> The SDHCI driver has a case where the interrupt thread waits for another
> interrupt (from its primary handler) in order to make progress.

Yes, that uses a driver specific way to handle that. This patch series
provides nvme specific handling for long running irq threads.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-29  9:13   ` Sebastian Andrzej Siewior
@ 2019-11-30 18:10     ` Keith Busch
  2019-12-02  1:10       ` Ming Lei
  2019-12-02 16:51       ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 37+ messages in thread
From: Keith Busch @ 2019-11-30 18:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On Fri, Nov 29, 2019 at 10:13:02AM +0100, Sebastian Andrzej Siewior wrote:
> On 2019-11-28 02:58:24 [+0900], Keith Busch wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 571b33b69c5f..9ec0933eb120 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
> >  	return ret;
> >  }
> >  
> > +static void nvme_irq_spin(int irq, void *data)
> > +{
> > +	while (nvme_irq(irq, data) != IRQ_NONE)
> > +		cond_resched();
> > +}
> 
> That interrupt thread runs at SCHED_FIFO prio 50 by default. You will
> not get anything with a lower priority running (including SCHED_OTHER).
> You won't get preempted by another FIFO thread at prio 50 so I *think*
> that cond_rechsched() won't let you schedule another task/IRQ thread
> running at prio 50 either.

Hm, if we're really spinning here, the current alternative is that
we'd run a cpu 100% in irq context, which has its own problems. If the
interrupt thread has other scheduler issues, I think that indicates
yet another task needs to handle completions. Perhaps escalate to the
irq_poll solution Sagi advocated for if the threaded handler observes
need_resched() is true.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-30 18:10     ` Keith Busch
@ 2019-12-02  1:10       ` Ming Lei
  2019-12-02  1:30         ` Keith Busch
  2019-12-02 16:51       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 37+ messages in thread
From: Ming Lei @ 2019-12-02  1:10 UTC (permalink / raw)
  To: Keith Busch; +Cc: Sebastian Andrzej Siewior, helgaas, hch, linux-nvme, sagi

On Sun, Dec 01, 2019 at 03:10:20AM +0900, Keith Busch wrote:
> On Fri, Nov 29, 2019 at 10:13:02AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2019-11-28 02:58:24 [+0900], Keith Busch wrote:
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 571b33b69c5f..9ec0933eb120 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
> > >  	return ret;
> > >  }
> > >  
> > > +static void nvme_irq_spin(int irq, void *data)
> > > +{
> > > +	while (nvme_irq(irq, data) != IRQ_NONE)
> > > +		cond_resched();
> > > +}
> > 
> > That interrupt thread runs at SCHED_FIFO prio 50 by default. You will
> > not get anything with a lower priority running (including SCHED_OTHER).
> > You won't get preempted by another FIFO thread at prio 50 so I *think*
> > that cond_rechsched() won't let you schedule another task/IRQ thread
> > running at prio 50 either.
> 
> Hm, if we're really spinning here, the current alternative is that
> we'd run a cpu 100% in irq context, which has its own problems. If the
> interrupt thread has other scheduler issues, I think that indicates
> yet another task needs to handle completions. Perhaps escalate to the
> irq_poll solution Sagi advocated for if the threaded handler observes
> need_resched() is true.

Even without this issue, threaded irq has other issue, such as, IO
latency is increased obviously.

Thanks,
Ming


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-12-02  1:10       ` Ming Lei
@ 2019-12-02  1:30         ` Keith Busch
  0 siblings, 0 replies; 37+ messages in thread
From: Keith Busch @ 2019-12-02  1:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: Sebastian Andrzej Siewior, helgaas, hch, linux-nvme, sagi

On Mon, Dec 02, 2019 at 09:10:31AM +0800, Ming Lei wrote:
> 
> Even without this issue, threaded irq has other issue, such as, IO
> latency is increased obviously.

Sure, I was working on something for this. I think we can have the
primary handler complete the cq once and return IRQ_HANDLED if the
queue is empty after. If new completions are still pending, then it
can return IRQ_WAKE_THREAD. This should also allow us to remove the
nvme.use_threaded_interrupts parameter since this should get the best
of both.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/4] nvme/pci: Spin threaded interrupt completions
  2019-11-30 18:10     ` Keith Busch
  2019-12-02  1:10       ` Ming Lei
@ 2019-12-02 16:51       ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-02 16:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On 2019-12-01 03:10:20 [+0900], Keith Busch wrote:
> On Fri, Nov 29, 2019 at 10:13:02AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2019-11-28 02:58:24 [+0900], Keith Busch wrote:
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 571b33b69c5f..9ec0933eb120 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1042,9 +1042,15 @@ static irqreturn_t nvme_irq(int irq, void *data)
> > >  	return ret;
> > >  }
> > >  
> > > +static void nvme_irq_spin(int irq, void *data)
> > > +{
> > > +	while (nvme_irq(irq, data) != IRQ_NONE)
> > > +		cond_resched();
> > > +}
> > 
> > That interrupt thread runs at SCHED_FIFO prio 50 by default. You will
> > not get anything with a lower priority running (including SCHED_OTHER).
> > You won't get preempted by another FIFO thread at prio 50 so I *think*
> > that cond_rechsched() won't let you schedule another task/IRQ thread
> > running at prio 50 either.
> 
> Hm, if we're really spinning here, the current alternative is that
> we'd run a cpu 100% in irq context, which has its own problems. If the
> interrupt thread has other scheduler issues, I think that indicates
> yet another task needs to handle completions. Perhaps escalate to the
> irq_poll solution Sagi advocated for if the threaded handler observes
> need_resched() is true.

I'm not against using a threaded IRQ but pointing out that the
cond_resched() usage here is wrong.  For SCHED_FIFO tasks, the scheduler
will set the TIF_NEED_RESCHED flag and send an IPI for an re-schedule to
that CPU. This only happens if the priority of the other task is higher
than the default 50 (and all interrupt threads run by default at
FIFO/50).

A SCHED_OTHER task on a PREEMPT_NONE/PREEMPT_VOLUNTARY kernel won't
preempt another SCHED_OTHER task inside the kernel. The scheduler will
only set TIF_NEED_RESCHED flag to signal that another task may run. This
cond_resched() would then act as preemption point in this case.

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-11-30 17:02       ` Keith Busch
@ 2019-12-02 17:05         ` Sebastian Andrzej Siewior
  2019-12-02 17:12           ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-02 17:05 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, helgaas, hch, linux-nvme, sagi

On 2019-12-01 02:02:22 [+0900], Keith Busch wrote:
> On Fri, Nov 29, 2019 at 06:05:45PM +0100, Sebastian Andrzej Siewior wrote:
> > Ach okay. But I wouldn't consider it as a "bug" if your threaded handler
> > returns IRQ_HANDLED and requires a longer period of time. Thsi seems
> > fine to me.
> > The SDHCI driver has a case where the interrupt thread waits for another
> > interrupt (from its primary handler) in order to make progress.
> 
> Yes, that uses a driver specific way to handle that. This patch series
> provides nvme specific handling for long running irq threads.

That might be a misunderstanding. I think if your threaded-IRQ handler
is running legitimately for longer period of time (and making progress)
and IRQ core's "nobody-care" detector shuts it down then the detector
might need a tweak.
The worst thing that could happen, is that the RT tasks run for too long
and the scheduler punishes them to protect against run-away-tasks (the
default limit is at 950ms RT task time within 1 second,
sched_rt_runtime_us).

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-12-02 17:05         ` Sebastian Andrzej Siewior
@ 2019-12-02 17:12           ` Christoph Hellwig
  2019-12-02 18:06             ` Keith Busch
  2019-12-02 19:57             ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-12-02 17:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: sagi, linux-nvme, ming.lei, helgaas, Keith Busch, tglx, hch

On Mon, Dec 02, 2019 at 06:05:38PM +0100, Sebastian Andrzej Siewior wrote:
> That might be a misunderstanding. I think if your threaded-IRQ handler
> is running legitimately for longer period of time (and making progress)
> and IRQ core's "nobody-care" detector shuts it down then the detector
> might need a tweak.
> The worst thing that could happen, is that the RT tasks run for too long
> and the scheduler punishes them to protect against run-away-tasks (the
> default limit is at 950ms RT task time within 1 second,
> sched_rt_runtime_us).

The problem is that by doing the agressive polling we can keep one
CPU busy just running the irq handler and starve processes on that
CPU if an NVMe queue servers multiple CPUs.

That's why I had the previous idea of one irq thread per cpu that
is assigned to the irq.  We'd have to encode a relative index into
the hardirq handler return value which we get from bits encoded in
the NVMe command ID, but that should be doable.  At that point we
shouldn't need the cond_resched.  I can try to hack that up, but
I'm not an expert on the irq thread code.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-12-02 17:12           ` Christoph Hellwig
@ 2019-12-02 18:06             ` Keith Busch
  2019-12-03  7:40               ` Christoph Hellwig
  2019-12-02 19:57             ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 37+ messages in thread
From: Keith Busch @ 2019-12-02 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: sagi, Sebastian Andrzej Siewior, linux-nvme, ming.lei, helgaas, tglx

On Mon, Dec 02, 2019 at 06:12:39PM +0100, Christoph Hellwig wrote:
> That's why I had the previous idea of one irq thread per cpu that
> is assigned to the irq.  We'd have to encode a relative index into
> the hardirq handler return value which we get from bits encoded in
> the NVMe command ID, but that should be doable.  At that point we
> shouldn't need the cond_resched.  I can try to hack that up, but
> I'm not an expert on the irq thread code.

I'm curious how you intend to implement this. We can't have two threads
operating on the same CQ at the same time since they have to reap the
CQ sequentially, so the threads can't selectively choose which entries it
handles in a queue with mixed encoded CPUs.

Perhaps we can have just one completion thread call
smp_call_function_single_async() with the encoded CPU?

But sadly, I recall we've observed broken controllers break when a
command id exceeds the queue-depth, and encoding CPUs in the command id
would do that. Hardware ruins our purity...

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-12-02 17:12           ` Christoph Hellwig
  2019-12-02 18:06             ` Keith Busch
@ 2019-12-02 19:57             ` Sebastian Andrzej Siewior
  2019-12-03  7:42               ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-02 19:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: sagi, linux-nvme, ming.lei, helgaas, Keith Busch, tglx

On 2019-12-02 18:12:39 [+0100], Christoph Hellwig wrote:
> On Mon, Dec 02, 2019 at 06:05:38PM +0100, Sebastian Andrzej Siewior wrote:
> > That might be a misunderstanding. I think if your threaded-IRQ handler
> > is running legitimately for longer period of time (and making progress)
> > and IRQ core's "nobody-care" detector shuts it down then the detector
> > might need a tweak.
> > The worst thing that could happen, is that the RT tasks run for too long
> > and the scheduler punishes them to protect against run-away-tasks (the
> > default limit is at 950ms RT task time within 1 second,
> > sched_rt_runtime_us).
> 
> The problem is that by doing the agressive polling we can keep one
> CPU busy just running the irq handler and starve processes on that
> CPU if an NVMe queue servers multiple CPUs.

and this is bad? The scheduler will move everything to other CPUs unless
it is for pinned to this CPU. You can offload even RCU these days :)
Performance wise it might be better to dedicate one CPU doing this work
instead spreading it over four CPUs each doing a fraction of it and
using same cache lines which bounce from one CPU to the next.
 
> That's why I had the previous idea of one irq thread per cpu that
> is assigned to the irq.  We'd have to encode a relative index into
> the hardirq handler return value which we get from bits encoded in
> the NVMe command ID, but that should be doable.  At that point we
> shouldn't need the cond_resched.  I can try to hack that up, but
> I'm not an expert on the irq thread code.

there is always one IRQ-thread per-CPU/Interrupt.
You could start a kthread_create_worker_on_cpu() on multiple CPUs and
feed them with work from your interrupt. And if you make it SCHED_FIFO
then you should be able to run your completion on multiple CPUs from one 
interrupt.

Sebastian

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-12-02 18:06             ` Keith Busch
@ 2019-12-03  7:40               ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-12-03  7:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: sagi, Sebastian Andrzej Siewior, linux-nvme, ming.lei, helgaas,
	tglx, Christoph Hellwig

On Tue, Dec 03, 2019 at 03:06:59AM +0900, Keith Busch wrote:
> On Mon, Dec 02, 2019 at 06:12:39PM +0100, Christoph Hellwig wrote:
> > That's why I had the previous idea of one irq thread per cpu that
> > is assigned to the irq.  We'd have to encode a relative index into
> > the hardirq handler return value which we get from bits encoded in
> > the NVMe command ID, but that should be doable.  At that point we
> > shouldn't need the cond_resched.  I can try to hack that up, but
> > I'm not an expert on the irq thread code.
> 
> I'm curious how you intend to implement this. We can't have two threads
> operating on the same CQ at the same time since they have to reap the
> CQ sequentially, so the threads can't selectively choose which entries it
> handles in a queue with mixed encoded CPUs.

True.

> Perhaps we can have just one completion thread call
> smp_call_function_single_async() with the encoded CPU?

Well, blk-mq can do just that for us from blk_mq_complete_request.

> But sadly, I recall we've observed broken controllers break when a
> command id exceeds the queue-depth, and encoding CPUs in the command id
> would do that. Hardware ruins our purity...

Sigh..

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] nvme: Threaded interrupt handling improvements
  2019-12-02 19:57             ` Sebastian Andrzej Siewior
@ 2019-12-03  7:42               ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2019-12-03  7:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: sagi, linux-nvme, ming.lei, helgaas, Keith Busch, tglx,
	Christoph Hellwig

On Mon, Dec 02, 2019 at 08:57:30PM +0100, Sebastian Andrzej Siewior wrote:
> > The problem is that by doing the agressive polling we can keep one
> > CPU busy just running the irq handler and starve processes on that
> > CPU if an NVMe queue servers multiple CPUs.
> 
> and this is bad? The scheduler will move everything to other CPUs unless
> it is for pinned to this CPU. You can offload even RCU these days :)
> Performance wise it might be better to dedicate one CPU doing this work
> instead spreading it over four CPUs each doing a fraction of it and
> using same cache lines which bounce from one CPU to the next.

Ok, maybe things are getting better these days.  I remember we did need
the QUEUE_FLAG_SAME_FORCE flag back in the day to ensure I/O submitters
are properly throttled by the completions they receive.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2019-12-03  7:42 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-27 17:58 [PATCH 0/4] nvme: Threaded interrupt handling improvements Keith Busch
2019-11-27 17:58 ` [PATCH 1/4] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch
2019-11-28  2:42   ` Sagi Grimberg
2019-11-28  3:41     ` Keith Busch
2019-11-28  7:17   ` Christoph Hellwig
2019-11-27 17:58 ` [PATCH 2/4] nvme/pci: Mask legacy and MSI in threaded handler Keith Busch
2019-11-28  3:39   ` Ming Lei
2019-11-28  3:48     ` Keith Busch
2019-11-28  3:58       ` Ming Lei
2019-11-28  4:14         ` Keith Busch
2019-11-28  8:41           ` Ming Lei
2019-11-27 17:58 ` [PATCH 3/4] nvme/pci: Mask MSIx interrupts for threaded handling Keith Busch
2019-11-28  7:19   ` Christoph Hellwig
2019-11-27 17:58 ` [PATCH 4/4] nvme/pci: Spin threaded interrupt completions Keith Busch
2019-11-28  2:46   ` Sagi Grimberg
2019-11-28  3:28     ` Keith Busch
2019-11-28  3:51       ` Ming Lei
2019-11-28  3:58         ` Keith Busch
2019-11-28  7:22   ` Christoph Hellwig
2019-11-29  9:13   ` Sebastian Andrzej Siewior
2019-11-30 18:10     ` Keith Busch
2019-12-02  1:10       ` Ming Lei
2019-12-02  1:30         ` Keith Busch
2019-12-02 16:51       ` Sebastian Andrzej Siewior
2019-11-28  7:50 ` [PATCH 0/4] nvme: Threaded interrupt handling improvements Christoph Hellwig
2019-11-28 17:59   ` Keith Busch
2019-11-29  8:30     ` Christoph Hellwig
2019-11-29  9:46 ` Sebastian Andrzej Siewior
2019-11-29 16:27   ` Keith Busch
2019-11-29 17:05     ` Sebastian Andrzej Siewior
2019-11-30 17:02       ` Keith Busch
2019-12-02 17:05         ` Sebastian Andrzej Siewior
2019-12-02 17:12           ` Christoph Hellwig
2019-12-02 18:06             ` Keith Busch
2019-12-03  7:40               ` Christoph Hellwig
2019-12-02 19:57             ` Sebastian Andrzej Siewior
2019-12-03  7:42               ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).