* [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-02 22:22 [RFC PATCH 0/3] nvme threaded interrupt improvements Keith Busch
@ 2019-12-02 22:22 ` Keith Busch
2019-12-03 7:50 ` Christoph Hellwig
2019-12-03 10:09 ` Sebastian Andrzej Siewior
2019-12-02 22:22 ` [RFC PATCH 2/3] nvme/pci: Remove use_threaded_interrupts parameter Keith Busch
2019-12-02 22:22 ` [RFC PATCH 3/3] nvme/pci: Poll for new completions in irq thread Keith Busch
2 siblings, 2 replies; 11+ messages in thread
From: Keith Busch @ 2019-12-02 22:22 UTC (permalink / raw)
To: linux-nvme; +Cc: Keith Busch, bigeasy, ming.lei, hch, sagi
The nvme threaded interrupt handler reduces CPU time spent in hard irq
context, but waking it increases latency for low queue depth workloads.
Poll the completion queue once from the primary handler and wake the
thread only if more completions remain after. Since there is a window
of time where the threaded and primary handlers may run simultaneously,
add a new nvmeq flag so that the two can synchronize which owns processing
the queue.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4022a872d29c..c811c0984fe0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -186,6 +186,7 @@ struct nvme_queue {
#define NVMEQ_SQ_CMB 1
#define NVMEQ_DELETE_ERROR 2
#define NVMEQ_POLLED 3
+#define NVMEQ_THREAD 4
u32 *dbbuf_sq_db;
u32 *dbbuf_cq_db;
u32 *dbbuf_sq_ei;
@@ -1047,15 +1048,27 @@ static irqreturn_t nvme_irq_thread(int irq, void *data)
__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
else
writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC);
+ clear_bit(NVMEQ_THREAD, &nvmeq->flags);
return IRQ_HANDLED;
}
static irqreturn_t nvme_irq_check(int irq, void *data)
{
struct nvme_queue *nvmeq = data;
+ irqreturn_t ret;
if (!nvme_cqe_pending(nvmeq))
return IRQ_NONE;
+
+ if (test_and_set_bit(NVMEQ_THREAD, &nvmeq->flags))
+ return IRQ_NONE;
+
+ ret = nvme_irq(irq, data);
+ if (!nvme_cqe_pending(nvmeq)) {
+ clear_bit(NVMEQ_THREAD, &nvmeq->flags);
+ return ret;
+ }
+
if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
else
--
2.21.0
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-02 22:22 ` [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler Keith Busch
@ 2019-12-03 7:50 ` Christoph Hellwig
2019-12-03 9:39 ` Sebastian Andrzej Siewior
2019-12-03 11:50 ` Keith Busch
2019-12-03 10:09 ` Sebastian Andrzej Siewior
1 sibling, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2019-12-03 7:50 UTC (permalink / raw)
To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, tglx, hch
On Tue, Dec 03, 2019 at 07:22:04AM +0900, Keith Busch wrote:
> The nvme threaded interrupt handler reduces CPU time spent in hard irq
> context, but waking it increases latency for low queue depth workloads.
>
> Poll the completion queue once from the primary handler and wake the
> thread only if more completions remain after.
How is this going to work with -rt, which wants to run all actual
interrupt work in the irq thread?
> Since there is a window
> of time where the threaded and primary handlers may run simultaneously,
> add a new nvmeq flag so that the two can synchronize which owns processing
> the queue.
If the above scheme is something that the irq subsystem maintainers are
fine with I think we need to lift that synchronization into the core
irq code instead of open coding it in drivers.
But I'd love to understand what kind of performance this split scheme
gives us to start with. Do you have any numbers?
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-03 7:50 ` Christoph Hellwig
@ 2019-12-03 9:39 ` Sebastian Andrzej Siewior
2019-12-03 11:50 ` Keith Busch
1 sibling, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-03 9:39 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, tglx, sagi, linux-nvme, ming.lei
On 2019-12-03 08:50:46 [+0100], Christoph Hellwig wrote:
> On Tue, Dec 03, 2019 at 07:22:04AM +0900, Keith Busch wrote:
> > The nvme threaded interrupt handler reduces CPU time spent in hard irq
> > context, but waking it increases latency for low queue depth workloads.
> >
> > Poll the completion queue once from the primary handler and wake the
> > thread only if more completions remain after.
>
> How is this going to work with -rt, which wants to run all actual
> interrupt work in the irq thread?
It will thread both handlers.
Sebastian
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-03 7:50 ` Christoph Hellwig
2019-12-03 9:39 ` Sebastian Andrzej Siewior
@ 2019-12-03 11:50 ` Keith Busch
1 sibling, 0 replies; 11+ messages in thread
From: Keith Busch @ 2019-12-03 11:50 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: bigeasy, tglx, sagi, linux-nvme, ming.lei
On Tue, Dec 03, 2019 at 08:50:46AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 03, 2019 at 07:22:04AM +0900, Keith Busch wrote:
> > The nvme threaded interrupt handler reduces CPU time spent in hard irq
> > context, but waking it increases latency for low queue depth workloads.
> >
> > Poll the completion queue once from the primary handler and wake the
> > thread only if more completions remain after.
>
> How is this going to work with -rt, which wants to run all actual
> interrupt work in the irq thread?
That just has the primary and the bottom half run as threads. This
patch just avoids a context switch for shallow workloads whether
use -rt or not.
> > Since there is a window
> > of time where the threaded and primary handlers may run simultaneously,
> > add a new nvmeq flag so that the two can synchronize which owns processing
> > the queue.
>
> If the above scheme is something that the irq subsystem maintainers are
> fine with I think we need to lift that synchronization into the core
> irq code instead of open coding it in drivers.
It's not generally applicable to all drivers, so it'd have to be
an opt-in. I think IRQF_ONESHOT pretty much captures the same idea,
though.
> But I'd love to understand what kind of performance this split scheme
> gives us to start with. Do you have any numbers?
I do have a little bit of data. On a queue-depth 1 test, returning
IRQ_WAKE_THREAD adds just under 1usec latency compared to completing
in the primary handler. That's about 10% for some of the faster
devices out there.
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-02 22:22 ` [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler Keith Busch
2019-12-03 7:50 ` Christoph Hellwig
@ 2019-12-03 10:09 ` Sebastian Andrzej Siewior
2019-12-03 11:16 ` Keith Busch
1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-03 10:09 UTC (permalink / raw)
To: Keith Busch; +Cc: ming.lei, hch, linux-nvme, sagi
On 2019-12-03 07:22:04 [+0900], Keith Busch wrote:
> The nvme threaded interrupt handler reduces CPU time spent in hard irq
> context, but waking it increases latency for low queue depth workloads.
>
> Poll the completion queue once from the primary handler and wake the
> thread only if more completions remain after. Since there is a window
> of time where the threaded and primary handlers may run simultaneously,
> add a new nvmeq flag so that the two can synchronize which owns processing
> the queue.
It depends on what you mean by "run simultaneously" but it sounds like
this does not happen.
The primary handler disables the interrupt source and returns
IRQ_WAKE_THREAD. From now on, the primary handler won't fire (unless it
is a shared handler and someone else gets an interrupt).
The thread receives a wakeup and will be show on the CPU.
Once the threaded handler completes its work it enables the interrupt
source as the last thing it does. This may lead to an interrupt before
the threaded handler returns to the caller.
The primary handler may return IRQ_WAKE_THREAD. In this case the
threaded handler will be invoked again.
If it wouldn't then this could happen:
primary handler threaded handler
nvme_irq_check()
if (test_and_set_bit(NVMEQ_THREAD, &nvmeq->flags))
ret = nvme_irq(irq, data);
if (!nvme_cqe_pending(nvmeq))
if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1);
return IRQ_WAKE_THREAD;
nvme_irq_thread()
nvme_irq()
if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
nvme_irq_check()
if (test_and_set_bit(NVMEQ_THREAD, &nvmeq->flags))
return IRQ_NONE;
clear_bit(NVMEQ_THREAD, &nvmeq->flags);
return IRQ_HANDLED
> Signed-off-by: Keith Busch <kbusch@kernel.org>
Sebastian
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-03 10:09 ` Sebastian Andrzej Siewior
@ 2019-12-03 11:16 ` Keith Busch
2019-12-04 10:21 ` Ming Lei
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2019-12-03 11:16 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: ming.lei, hch, linux-nvme, sagi
On Tue, Dec 03, 2019 at 11:09:30AM +0100, Sebastian Andrzej Siewior wrote:
> On 2019-12-03 07:22:04 [+0900], Keith Busch wrote:
> > The nvme threaded interrupt handler reduces CPU time spent in hard irq
> > context, but waking it increases latency for low queue depth workloads.
> >
> > Poll the completion queue once from the primary handler and wake the
> > thread only if more completions remain after. Since there is a window
> > of time where the threaded and primary handlers may run simultaneously,
> > add a new nvmeq flag so that the two can synchronize which owns processing
> > the queue.
>
> It depends on what you mean by "run simultaneously" but it sounds like
> this does not happen.
>
> The primary handler disables the interrupt source and returns
> IRQ_WAKE_THREAD. From now on, the primary handler won't fire (unless it
> is a shared handler and someone else gets an interrupt).
The driver won't share these interrupts, despite some wierd pci
host bridges that force sharing among other devices (ex: see the
only user of handle_untracked_irq). That isn't what I was considering
though.
It's true the controller won't send new MSIs once masked, but my
concern is for MSIs on the wire that pass that MMIO mask write.
Those retrigger the primary handler after it previously returned
IRQ_WAKE_THREAD.
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler
2019-12-03 11:16 ` Keith Busch
@ 2019-12-04 10:21 ` Ming Lei
0 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2019-12-04 10:21 UTC (permalink / raw)
To: Keith Busch; +Cc: Sebastian Andrzej Siewior, hch, linux-nvme, sagi
On Tue, Dec 03, 2019 at 04:16:26AM -0700, Keith Busch wrote:
> On Tue, Dec 03, 2019 at 11:09:30AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2019-12-03 07:22:04 [+0900], Keith Busch wrote:
> > > The nvme threaded interrupt handler reduces CPU time spent in hard irq
> > > context, but waking it increases latency for low queue depth workloads.
> > >
> > > Poll the completion queue once from the primary handler and wake the
> > > thread only if more completions remain after. Since there is a window
> > > of time where the threaded and primary handlers may run simultaneously,
> > > add a new nvmeq flag so that the two can synchronize which owns processing
> > > the queue.
> >
> > It depends on what you mean by "run simultaneously" but it sounds like
> > this does not happen.
> >
> > The primary handler disables the interrupt source and returns
> > IRQ_WAKE_THREAD. From now on, the primary handler won't fire (unless it
> > is a shared handler and someone else gets an interrupt).
>
> The driver won't share these interrupts, despite some wierd pci
> host bridges that force sharing among other devices (ex: see the
> only user of handle_untracked_irq). That isn't what I was considering
> though.
>
> It's true the controller won't send new MSIs once masked, but my
> concern is for MSIs on the wire that pass that MMIO mask write.
Could you explain a bit what is the 'MSIs on the wire'? And where is it
from?
Thanks,
Ming
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 2/3] nvme/pci: Remove use_threaded_interrupts parameter
2019-12-02 22:22 [RFC PATCH 0/3] nvme threaded interrupt improvements Keith Busch
2019-12-02 22:22 ` [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler Keith Busch
@ 2019-12-02 22:22 ` Keith Busch
2019-12-02 22:22 ` [RFC PATCH 3/3] nvme/pci: Poll for new completions in irq thread Keith Busch
2 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2019-12-02 22:22 UTC (permalink / raw)
To: linux-nvme; +Cc: Keith Busch, bigeasy, ming.lei, hch, sagi
The non-threaded nvme completion handler provides no additional benefit
over the threaded one. Remove the option.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/pci.c | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c811c0984fe0..634c96bafb70 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -42,9 +42,6 @@
#define NVME_MAX_KB_SZ 4096
#define NVME_MAX_SEGS 127
-static int use_threaded_interrupts;
-module_param(use_threaded_interrupts, int, 0);
-
static bool use_cmb_sqes = true;
module_param(use_cmb_sqes, bool, 0444);
MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes");
@@ -1529,14 +1526,8 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
int nr = nvmeq->dev->ctrl.instance;
- if (use_threaded_interrupts) {
- return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
- nvme_irq_thread, nvmeq, "nvme%dq%d", nr,
- nvmeq->qid);
- } else {
- return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
- NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
- }
+ return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
+ nvme_irq_thread, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
}
static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
--
2.21.0
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 3/3] nvme/pci: Poll for new completions in irq thread
2019-12-02 22:22 [RFC PATCH 0/3] nvme threaded interrupt improvements Keith Busch
2019-12-02 22:22 ` [RFC PATCH 1/3] nvme/pci: Poll the cq in the primary irq handler Keith Busch
2019-12-02 22:22 ` [RFC PATCH 2/3] nvme/pci: Remove use_threaded_interrupts parameter Keith Busch
@ 2019-12-02 22:22 ` Keith Busch
2019-12-03 10:17 ` Sebastian Andrzej Siewior
2 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2019-12-02 22:22 UTC (permalink / raw)
To: linux-nvme; +Cc: Keith Busch, bigeasy, ming.lei, hch, sagi
A controller may post new completions while the irq thread is handling
previously seen completions. Have the irq thread poll for these as long
as new completions are available. This improves bandwidth and reduces
CPU time spend in irq context.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/pci.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 634c96bafb70..44d8f701dce8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1040,7 +1040,9 @@ static irqreturn_t nvme_irq_thread(int irq, void *data)
{
struct nvme_queue *nvmeq = data;
- nvme_irq(irq, data);
+ while (nvme_irq(irq, data) != IRQ_NONE && !need_resched())
+ cpu_relax();
+
if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
__pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
else
--
2.21.0
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 3/3] nvme/pci: Poll for new completions in irq thread
2019-12-02 22:22 ` [RFC PATCH 3/3] nvme/pci: Poll for new completions in irq thread Keith Busch
@ 2019-12-03 10:17 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-03 10:17 UTC (permalink / raw)
To: Keith Busch; +Cc: ming.lei, hch, linux-nvme, sagi
On 2019-12-03 07:22:06 [+0900], Keith Busch wrote:
> A controller may post new completions while the irq thread is handling
> previously seen completions. Have the irq thread poll for these as long
> as new completions are available. This improves bandwidth and reduces
> CPU time spend in irq context.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/host/pci.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 634c96bafb70..44d8f701dce8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1040,7 +1040,9 @@ static irqreturn_t nvme_irq_thread(int irq, void *data)
> {
> struct nvme_queue *nvmeq = data;
>
> - nvme_irq(irq, data);
> + while (nvme_irq(irq, data) != IRQ_NONE && !need_resched())
as explained earlier, the need_resched() usage here will not do what you
expect it to do.
> + cpu_relax();
Why cpu_relax()?
If you need to acquire a lock and spin then cpu_relax() may give the
other hyper thread a higher priority (POWER) or remove all speculative
writes from the pipeline (X86). This is beneficial for the locking
process, none of this seems to be required here.
> +
> if (to_pci_dev(nvmeq->dev->dev)->msix_enabled)
> __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0);
> else
Sebastian
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread