linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Bjorn Helgaas <helgaas@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org,
	Keith Busch <keith.busch@intel.com>,
	Marc Zyngier <marc.zyngier@arm.com>,
	Sumit Saxena <sumit.saxena@broadcom.com>,
	Kashyap Desai <kashyap.desai@broadcom.com>,
	Shivasharan Srikanteshwara 
	<shivasharan.srikanteshwara@broadcom.com>
Subject: [patch v6 4/7] nvme-pci: Simplify interrupt allocation
Date: Sat, 16 Feb 2019 18:13:10 +0100	[thread overview]
Message-ID: <20190216172228.602546658@linutronix.de> (raw)
In-Reply-To: 20190216171306.403545970@linutronix.de

From: Ming Lei <ming.lei@redhat.com>

The NVME PCI driver contains a tedious mechanism for interrupt
allocation, which is necessary to adjust the number and size of interrupt
sets to the maximum available number of interrupts which depends on the
underlying PCI capabilities and the available CPU resources.

It works around the former short comings of the PCI and core interrupt
allocation mechanims in combination with interrupt sets.

The PCI interrupt allocation function allows to provide a maximum and a
minimum number of interrupts to be allocated and tries to allocate as
many as possible. This worked without driver interaction as long as there
was only a single set of interrupts to handle.

With the addition of support for multiple interrupt sets in the generic
affinity spreading logic, which is invoked from the PCI interrupt
allocation, the adaptive loop in the PCI interrupt allocation did not
work for multiple interrupt sets. The reason is that depending on the
total number of interrupts which the PCI allocation adaptive loop tries
to allocate in each step, the number and the size of the interrupt sets
need to be adapted as well. Due to the way the interrupt sets support was
implemented there was no way for the PCI interrupt allocation code or the
core affinity spreading mechanism to invoke a driver specific function
for adapting the interrupt sets configuration.

As a consequence the driver had to implement another adaptive loop around
the PCI interrupt allocation function and calling that with maximum and
minimum interrupts set to the same value. This ensured that the
allocation either succeeded or immediately failed without any attempt to
adjust the number of interrupts in the PCI code.

The core code now allows drivers to provide a callback to recalculate the
number and the size of interrupt sets during PCI interrupt allocation,
which in turn allows the PCI interrupt allocation function to be called
in the same way as with a single set of interrupts. The PCI code handles
the adaptive loop and the interrupt affinity spreading mechanism invokes
the driver callback to adapt the interrupt set configuration to the
current loop value. This replaces the adaptive loop in the driver
completely.

Implement the NVME specific callback which adjusts the interrupt sets
configuration and remove the adaptive allocation loop.

[ tglx: Simplify the callback further and restore the dropped adjustment of
  	number of sets ]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 drivers/nvme/host/pci.c |  116 ++++++++++++++++--------------------------------
 1 file changed, 39 insertions(+), 77 deletions(-)

Index: b/drivers/nvme/host/pci.c
===================================================================
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2041,41 +2041,42 @@ static int nvme_setup_host_mem(struct nv
 	return ret;
 }
 
-/* irq_queues covers admin queue */
-static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
+/*
+ * nirqs is the number of interrupts available for write and read
+ * queues. The core already reserved an interrupt for the admin queue.
+ */
+static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 {
-	unsigned int this_w_queues = write_queues;
-
-	WARN_ON(!irq_queues);
-
-	/*
-	 * Setup read/write queue split, assign admin queue one independent
-	 * irq vector if irq_queues is > 1.
-	 */
-	if (irq_queues <= 2) {
-		dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
-		dev->io_queues[HCTX_TYPE_READ] = 0;
-		return;
-	}
-
-	/*
-	 * If 'write_queues' is set, ensure it leaves room for at least
-	 * one read queue and one admin queue
-	 */
-	if (this_w_queues >= irq_queues)
-		this_w_queues = irq_queues - 2;
+	struct nvme_dev *dev = affd->priv;
+	unsigned int nr_read_queues;
 
 	/*
-	 * If 'write_queues' is set to zero, reads and writes will share
-	 * a queue set.
-	 */
-	if (!this_w_queues) {
-		dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues - 1;
-		dev->io_queues[HCTX_TYPE_READ] = 0;
+	 * If there is no interupt available for queues, ensure that
+	 * the default queue is set to 1. The affinity set size is
+	 * also set to one, but the irq core ignores it for this case.
+	 *
+	 * If only one interrupt is available or 'write_queue' == 0, combine
+	 * write and read queues.
+	 *
+	 * If 'write_queues' > 0, ensure it leaves room for at least one read
+	 * queue.
+	 */
+	if (!nrirqs) {
+		nrirqs = 1;
+		nr_read_queues = 0;
+	} else if (nrirqs == 1 || !write_queues) {
+		nr_read_queues = 0;
+	} else if (write_queues >= nrirqs) {
+		nr_read_queues = 1;
 	} else {
-		dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
-		dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues - 1;
+		nr_read_queues = nrirqs - write_queues;
 	}
+
+	dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
+	affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
+	dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
+	affd->set_size[HCTX_TYPE_READ] = nr_read_queues;
+	affd->nr_sets = nr_read_queues ? 2 : 1;
 }
 
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
@@ -2083,10 +2084,9 @@ static int nvme_setup_irqs(struct nvme_d
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	struct irq_affinity affd = {
 		.pre_vectors	= 1,
-		.nr_sets	= 2,
+		.calc_sets	= nvme_calc_irq_sets,
+		.priv		= dev,
 	};
-	unsigned int *irq_sets = affd.set_size;
-	int result = 0;
 	unsigned int irq_queues, this_p_queues;
 
 	/*
@@ -2102,51 +2102,12 @@ static int nvme_setup_irqs(struct nvme_d
 	}
 	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
 
-	/*
-	 * For irq sets, we have to ask for minvec == maxvec. This passes
-	 * any reduction back to us, so we can adjust our queue counts and
-	 * IRQ vector needs.
-	 */
-	do {
-		nvme_calc_io_queues(dev, irq_queues);
-		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
-		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
-		if (!irq_sets[1])
-			affd.nr_sets = 1;
-
-		/*
-		 * If we got a failure and we're down to asking for just
-		 * 1 + 1 queues, just ask for a single vector. We'll share
-		 * that between the single IO queue and the admin queue.
-		 * Otherwise, we assign one independent vector to admin queue.
-		 */
-		if (irq_queues > 1)
-			irq_queues = irq_sets[0] + irq_sets[1] + 1;
+	/* Initialize for the single interrupt case */
+	dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
+	dev->io_queues[HCTX_TYPE_READ] = 0;
 
-		result = pci_alloc_irq_vectors_affinity(pdev, irq_queues,
-				irq_queues,
-				PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
-
-		/*
-		 * Need to reduce our vec counts. If we get ENOSPC, the
-		 * platform should support mulitple vecs, we just need
-		 * to decrease our ask. If we get EINVAL, the platform
-		 * likely does not. Back down to ask for just one vector.
-		 */
-		if (result == -ENOSPC) {
-			irq_queues--;
-			if (!irq_queues)
-				return result;
-			continue;
-		} else if (result == -EINVAL) {
-			irq_queues = 1;
-			continue;
-		} else if (result <= 0)
-			return -EIO;
-		break;
-	} while (1);
-
-	return result;
+	return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
+			      PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 }
 
 static void nvme_disable_io_queues(struct nvme_dev *dev)
@@ -3021,6 +2982,7 @@ static struct pci_driver nvme_driver = {
 
 static int __init nvme_init(void)
 {
+	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
 	return pci_register_driver(&nvme_driver);
 }
 



  parent reply	other threads:[~2019-02-16 17:26 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-16 17:13 [patch v6 0/7] genirq/affinity: Overhaul the multiple interrupt sets support Thomas Gleixner
2019-02-16 17:13 ` [patch v6 1/7] genirq/affinity: Code consolidation Thomas Gleixner
2019-02-17 13:36   ` Ming Lei
2019-02-18 10:25   ` [tip:irq/core] " tip-bot for Thomas Gleixner
2019-02-16 17:13 ` [patch v6 2/7] genirq/affinity: Store interrupt sets size in struct irq_affinity Thomas Gleixner
2019-02-18 10:26   ` [tip:irq/core] " tip-bot for Ming Lei
2019-02-16 17:13 ` [patch v6 3/7] genirq/affinity: Add new callback for (re)calculating interrupt sets Thomas Gleixner
2019-02-18 10:27   ` [tip:irq/core] " tip-bot for Ming Lei
2021-06-15 19:57   ` [patch v6 3/7] " Bjorn Helgaas
2021-06-15 20:04     ` Christoph Hellwig
2021-06-16  0:40     ` Ming Lei
2021-06-18 19:32       ` Thomas Gleixner
2021-06-18 19:19     ` Thomas Gleixner
2019-02-16 17:13 ` Thomas Gleixner [this message]
2019-02-18 10:27   ` [tip:irq/core] nvme-pci: Simplify interrupt allocation tip-bot for Ming Lei
2019-02-16 17:13 ` [patch v6 5/7] genirq/affinity: Remove the leftovers of the original set support Thomas Gleixner
2019-02-17 13:39   ` Ming Lei
2019-02-18 10:28   ` [tip:irq/core] " tip-bot for Thomas Gleixner
2019-02-16 17:13 ` [patch v6 6/7] PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets Thomas Gleixner
2019-02-17 13:39   ` Ming Lei
2019-02-18 10:29   ` [tip:irq/core] " tip-bot for Thomas Gleixner
2019-02-16 17:13 ` [patch v6 7/7] genirq/affinity: Add support for non-managed affinity sets Thomas Gleixner
2019-02-17 13:45   ` Ming Lei
2019-02-17 19:17     ` Thomas Gleixner
2019-02-18  2:49       ` Ming Lei
2019-02-18  7:25         ` Thomas Gleixner
2019-02-18  8:43 ` [patch v6 0/7] genirq/affinity: Overhaul the multiple interrupt sets support Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190216172228.602546658@linutronix.de \
    --to=tglx@linutronix.de \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=kashyap.desai@broadcom.com \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=marc.zyngier@arm.com \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).