* [PATCHv2 0/2] @ 2019-12-02 22:20 Keith Busch 2019-12-02 22:20 ` [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch 2019-12-02 22:20 ` [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers Keith Busch 0 siblings, 2 replies; 10+ messages in thread From: Keith Busch @ 2019-12-02 22:20 UTC (permalink / raw) To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, helgaas, Keith Busch, tglx, hch NVMe devices had been able to continue sending interrupt messages while the driver is handling the previous notification when threaded interrupts are used. This can cause unnecessary CPU cycles spent in hard irq context, and potentially triggers spurious interrupt detection if the threaded handler runs sufficiently long: returning IRQ_WAKE_THREAD from the primrary handler doesn't observe any changes to irq desc->thread_handled. Use the appropriate masking for the interrupt type based on the NVMe specification recommendations (see NVMe 1.4 section 7.5.1.1 for more information) so that the primary handler does not get unnecessarily triggered. v1 -> v2: Don't split the nvme callbacks for MSI and MSI-x. Just check the type in the callback when selecting the masking method, and combine this in a single patch. Drop the more experiemental patches. That series will be posted separately. Keith Busch (2): PCI/MSI: Export __pci_msix_desc_mask_irq nvme/pci: Mask device interrupts for threaded handlers drivers/nvme/host/pci.c | 28 ++++++++++++++++++++++++---- drivers/pci/msi.c | 1 + 2 files changed, 25 insertions(+), 4 deletions(-) -- 2.21.0 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq 2019-12-02 22:20 [PATCHv2 0/2] Keith Busch @ 2019-12-02 22:20 ` Keith Busch 2019-12-02 22:46 ` Christoph Hellwig 2019-12-02 22:20 ` [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers Keith Busch 1 sibling, 1 reply; 10+ messages in thread From: Keith Busch @ 2019-12-02 22:20 UTC (permalink / raw) To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, helgaas, Keith Busch, tglx, hch Export the fast msix mask for drivers to use when the read-back is undesired. Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/pci/msi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 0884bedcfc7a..9e866929f4b0 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -225,6 +225,7 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag) return mask_bits; } +EXPORT_SYMBOL_GPL(__pci_msix_desc_mask_irq); static void msix_mask_irq(struct msi_desc *desc, u32 flag) { -- 2.21.0 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq 2019-12-02 22:20 ` [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch @ 2019-12-02 22:46 ` Christoph Hellwig 2019-12-03 9:04 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 10+ messages in thread From: Christoph Hellwig @ 2019-12-02 22:46 UTC (permalink / raw) To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, tglx, hch On Tue, Dec 03, 2019 at 07:20:57AM +0900, Keith Busch wrote: > Export the fast msix mask for drivers to use when the read-back is > undesired. As said last time calling this seems wrong as it breaks the irq_chip abstraction. But looking at the disable_irq_nosync semantics I think that function should do a non-pasted disable disable for MSI(-X) interrupts. Can you look into that? _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq 2019-12-02 22:46 ` Christoph Hellwig @ 2019-12-03 9:04 ` Sebastian Andrzej Siewior 2019-12-06 21:18 ` Keith Busch 0 siblings, 1 reply; 10+ messages in thread From: Sebastian Andrzej Siewior @ 2019-12-03 9:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: sagi, linux-nvme, ming.lei, helgaas, Keith Busch, tglx On 2019-12-02 23:46:03 [+0100], Christoph Hellwig wrote: > On Tue, Dec 03, 2019 at 07:20:57AM +0900, Keith Busch wrote: > > Export the fast msix mask for drivers to use when the read-back is > > undesired. > > As said last time calling this seems wrong as it breaks the irq_chip > abstraction. But looking at the disable_irq_nosync semantics I think > that function should do a non-pasted disable disable for MSI(-X) > interrupts. Can you look into that? Using disable_irq_nosync() would be the same as using IRQF_ONESHOT which is the preferred way. Keith complained about this as slow and avoiding the read-back as noticeable. The generic way would be pci_msi_mask_irq() and the difference | msix_mask_irq(desc, flag); | readl(desc->mask_base); /* Flush write to device */ would be that flush. Sebastian _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq 2019-12-03 9:04 ` Sebastian Andrzej Siewior @ 2019-12-06 21:18 ` Keith Busch 0 siblings, 0 replies; 10+ messages in thread From: Keith Busch @ 2019-12-06 21:18 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: sagi, linux-nvme, ming.lei, helgaas, tglx, Christoph Hellwig On Tue, Dec 03, 2019 at 10:04:54AM +0100, Sebastian Andrzej Siewior wrote: > On 2019-12-02 23:46:03 [+0100], Christoph Hellwig wrote: > > On Tue, Dec 03, 2019 at 07:20:57AM +0900, Keith Busch wrote: > > > Export the fast msix mask for drivers to use when the read-back is > > > undesired. > > > > As said last time calling this seems wrong as it breaks the irq_chip > > abstraction. But looking at the disable_irq_nosync semantics I think > > that function should do a non-pasted disable disable for MSI(-X) > > interrupts. Can you look into that? > > Using disable_irq_nosync() would be the same as using IRQF_ONESHOT which > is the preferred way. > Keith complained about this as slow and avoiding the read-back as > noticeable. > > The generic way would be pci_msi_mask_irq() and the difference > > | msix_mask_irq(desc, flag); > | readl(desc->mask_base); /* Flush write to device */ > > would be that flush. Right, so the solution should be simply to remove the readl(). I'm pretty sure that's safe to do: if mask_irq() returns before the device happens to see the mask is set and generates an undesired interrupt, the irq flow handler will observe irqd_irq_disabled() and return early. We have to deal with that anyway because the device may have sent an interrupt message at the same time the CPU was masking it. These would look like the same thing from the CPU perspective. I can't be completely sure it's safe for everyone though, so I'll try to quantify the impact of the read back on nvme with some real hardware, because I'm starting to wonder if this is really as important as I initially thought. If we do two readl()'s per IO, that's pretty noticable. It looks like it's adding about 1usec to the completion latency (plus or minus, it depends on the platform and if switches are involved). But we certainly don't need this for each IO. For low depth workloads, we can just handle all completions in the primary handler and never mask interrupts. In case there are lots of completions that are better handled in the nvme_irq_thread(), then we can call disable_irq_nosync(). That doesn't immediately mask MSIx because we didn't set IRQ_DISABLE_UNLAZY. The mask won't happen until we see another interrupt, which means the thread is going to be handling a lot of completions. The ratio of commands processed to msix masking would be quite low. As far as I can tell, the overhead seems pretty negligible. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers 2019-12-02 22:20 [PATCHv2 0/2] Keith Busch 2019-12-02 22:20 ` [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch @ 2019-12-02 22:20 ` Keith Busch 2019-12-03 7:47 ` Christoph Hellwig 2019-12-04 10:10 ` Sironi, Filippo 1 sibling, 2 replies; 10+ messages in thread From: Keith Busch @ 2019-12-02 22:20 UTC (permalink / raw) To: linux-nvme; +Cc: sagi, bigeasy, ming.lei, helgaas, Keith Busch, tglx, hch Local interrupts are reenabled when the nvme irq thread is woken, so subsequent MSI or a level triggered interrupts may restart the nvme irq check while the thread handler is running. This unnecessarily spends CPU cycles and potentially triggers spurious interrupt detection, disabling our nvme irq. Prevent the controller from sending future messages. For legacy and MSI, use the nvme interrupt mask/clear registers. For MSIx, use the quick control mask function. Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/nvme/host/pci.c | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 0590640ba62c..4022a872d29c 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -13,8 +13,10 @@ #include <linux/init.h> #include <linux/interrupt.h> #include <linux/io.h> +#include <linux/irq.h> #include <linux/mm.h> #include <linux/module.h> +#include <linux/msi.h> #include <linux/mutex.h> #include <linux/once.h> #include <linux/pci.h> @@ -1036,12 +1038,29 @@ static irqreturn_t nvme_irq(int irq, void *data) return ret; } +static irqreturn_t nvme_irq_thread(int irq, void *data) +{ + struct nvme_queue *nvmeq = data; + + nvme_irq(irq, data); + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0); + else + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC); + return IRQ_HANDLED; +} + static irqreturn_t nvme_irq_check(int irq, void *data) { struct nvme_queue *nvmeq = data; - if (nvme_cqe_pending(nvmeq)) - return IRQ_WAKE_THREAD; - return IRQ_NONE; + + if (!nvme_cqe_pending(nvmeq)) + return IRQ_NONE; + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1); + else + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS); + return IRQ_WAKE_THREAD; } /* @@ -1499,7 +1518,8 @@ static int queue_request_irq(struct nvme_queue *nvmeq) if (use_threaded_interrupts) { return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); + nvme_irq_thread, nvmeq, "nvme%dq%d", nr, + nvmeq->qid); } else { return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); -- 2.21.0 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers 2019-12-02 22:20 ` [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers Keith Busch @ 2019-12-03 7:47 ` Christoph Hellwig 2019-12-03 12:07 ` Keith Busch 2019-12-04 10:10 ` Sironi, Filippo 1 sibling, 1 reply; 10+ messages in thread From: Christoph Hellwig @ 2019-12-03 7:47 UTC (permalink / raw) To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, tglx, hch On Tue, Dec 03, 2019 at 07:20:58AM +0900, Keith Busch wrote: > +static irqreturn_t nvme_irq_thread(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + > + nvme_irq(irq, data); > + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) > + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0); > + else > + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC); So independent of the indirection issue can we have a theory of operation on why not using a read to flush the posted writes to disable/enable the interrupt (either variant) here is fine? Let's assume we have a worse case implementation where no write ever gets delivered to the device until we do a read neither of them will ever hit the device as we don't really do MMIO reads in nvme during normal operation. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers 2019-12-03 7:47 ` Christoph Hellwig @ 2019-12-03 12:07 ` Keith Busch 0 siblings, 0 replies; 10+ messages in thread From: Keith Busch @ 2019-12-03 12:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, tglx On Tue, Dec 03, 2019 at 08:47:23AM +0100, Christoph Hellwig wrote: > On Tue, Dec 03, 2019 at 07:20:58AM +0900, Keith Busch wrote: > > +static irqreturn_t nvme_irq_thread(int irq, void *data) > > +{ > > + struct nvme_queue *nvmeq = data; > > + > > + nvme_irq(irq, data); > > + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) > > + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0); > > + else > > + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC); > > So independent of the indirection issue can we have a theory of operation > on why not using a read to flush the posted writes to disable/enable the > interrupt (either variant) here is fine? Let's assume we have a worse > case implementation where no write ever gets delivered to the device > until we do a read neither of them will ever hit the device as we don't > really do MMIO reads in nvme during normal operation. Sure. First, if the masking MMIO write is stalled, the read back doesn't make it complete any faster. It just means the flush takes that much longer. Also note that the write can't be reordered with other writes to that device. You'd have to use writel_relaxed() for that. We notify the nvme controller of new commands using doorbell writes. If all writes are stalled, then the device doesn't have new commands to complete, so there won't be any interrupts for the driver to handle. If we want to craft a worst case scenario where the device was aware of the maximum possible commands on a queue (1023 for this driver), we could theoretically observe that many MSIs for exactly 1 IRQ_HANDLED return. This is below the 0.01% threshold for "nobody cared". Subsequent new commands would be stuck behind the MSI mask write, so there can't possibly be new IO while that mask write is stuck. So, I think if the device never sees the MSI masking while the thread is reaping the completion queue, we'll just spend CPU time in an unnecessary irq handler that would have otherwise been consumed on a flushing readl(). _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers 2019-12-02 22:20 ` [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers Keith Busch 2019-12-03 7:47 ` Christoph Hellwig @ 2019-12-04 10:10 ` Sironi, Filippo 2019-12-04 13:58 ` hch 1 sibling, 1 reply; 10+ messages in thread From: Sironi, Filippo @ 2019-12-04 10:10 UTC (permalink / raw) To: Keith Busch; +Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, tglx, hch > On 2. Dec 2019, at 23:20, Keith Busch <kbusch@kernel.org> wrote: > > Local interrupts are reenabled when the nvme irq thread is woken, so > subsequent MSI or a level triggered interrupts may restart the nvme irq > check while the thread handler is running. This unnecessarily spends CPU > cycles and potentially triggers spurious interrupt detection, disabling > our nvme irq. > > Prevent the controller from sending future messages. For legacy and MSI, > use the nvme interrupt mask/clear registers. For MSIx, use the quick > control mask function. > > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > drivers/nvme/host/pci.c | 28 ++++++++++++++++++++++++---- > 1 file changed, 24 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 0590640ba62c..4022a872d29c 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -13,8 +13,10 @@ > #include <linux/init.h> > #include <linux/interrupt.h> > #include <linux/io.h> > +#include <linux/irq.h> > #include <linux/mm.h> > #include <linux/module.h> > +#include <linux/msi.h> > #include <linux/mutex.h> > #include <linux/once.h> > #include <linux/pci.h> > @@ -1036,12 +1038,29 @@ static irqreturn_t nvme_irq(int irq, void *data) > return ret; > } > > +static irqreturn_t nvme_irq_thread(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + > + nvme_irq(irq, data); > + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) > + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 0); > + else > + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMC); > + return IRQ_HANDLED; > +} > + > static irqreturn_t nvme_irq_check(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > - if (nvme_cqe_pending(nvmeq)) > - return IRQ_WAKE_THREAD; > - return IRQ_NONE; > + > + if (!nvme_cqe_pending(nvmeq)) > + return IRQ_NONE; > + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) > + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1); Have you considered that __pci_msix_desc_mask_irq will cause a trap from guest to hypervisor mode when running virtualized? > + else > + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS); What's stopping us from always using this method? > + return IRQ_WAKE_THREAD; > } > > /* > @@ -1499,7 +1518,8 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + nvme_irq_thread, nvmeq, "nvme%dq%d", nr, > + nvmeq->qid); > } else { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > -- > 2.21.0 > > > _______________________________________________ > linux-nvme mailing list > linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers 2019-12-04 10:10 ` Sironi, Filippo @ 2019-12-04 13:58 ` hch 0 siblings, 0 replies; 10+ messages in thread From: hch @ 2019-12-04 13:58 UTC (permalink / raw) To: Sironi, Filippo Cc: sagi, bigeasy, linux-nvme, ming.lei, helgaas, Keith Busch, tglx, hch On Wed, Dec 04, 2019 at 10:10:05AM +0000, Sironi, Filippo wrote: > > + if (to_pci_dev(nvmeq->dev->dev)->msix_enabled) > > + __pci_msix_desc_mask_irq(irq_get_msi_desc(irq), 1); > > Have you considered that __pci_msix_desc_mask_irq will cause > a trap from guest to hypervisor mode when running virtualized? As mentioned in my reply we need to route this through the irq chip abstraction. With that a PV irq chip handler doesn't need to trap if it has smarter means available. > > + else > > + writel(1 << nvmeq->cq_vector, nvmeq->dev->bar + NVME_REG_INTMS); > > What's stopping us from always using this method? The fact that the NVMe spec doesn't specify it for MSI-X. From section 3.1.3 of NVMe 1.4: "This register is used to mask interrupts when using pin-based interrupts, single message MSI, or multiple message MSI. When using MSI-X, the interrupt mask table defined as part of MSI-X should be used to mask interrupts. Host software shall not access this register when configured for MSI-X; any accesses when configured for MSI-X is undefined. For interrupt behavior requirements, refer to section 7.5." _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-12-06 21:18 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-02 22:20 [PATCHv2 0/2] Keith Busch 2019-12-02 22:20 ` [PATCHv2 1/2] PCI/MSI: Export __pci_msix_desc_mask_irq Keith Busch 2019-12-02 22:46 ` Christoph Hellwig 2019-12-03 9:04 ` Sebastian Andrzej Siewior 2019-12-06 21:18 ` Keith Busch 2019-12-02 22:20 ` [PATCHv2 2/2] nvme/pci: Mask device interrupts for threaded handlers Keith Busch 2019-12-03 7:47 ` Christoph Hellwig 2019-12-03 12:07 ` Keith Busch 2019-12-04 10:10 ` Sironi, Filippo 2019-12-04 13:58 ` hch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).