linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/4] NVMe IRQ sets fixups
@ 2019-01-03 22:50 Keith Busch
  2019-01-03 22:50 ` [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-03 22:50 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci
  Cc: Keith Busch

Changes from v1:

  Added documentation to Documentation/PCI/MSI-HOWTO.txt

  Grammar, spelling, and format fixes to commit logs and code comments.

Keith Busch (4):
  nvme-pci: Set tagset nr_maps just once
  nvme-pci: Distribute io queue types after creation
  PCI/MSI: Handle vector reduce and retry
  nvme-pci: Use PCI to handle IRQ reduce and retry

 Documentation/PCI/MSI-HOWTO.txt |  36 +++++++++++-
 drivers/nvme/host/pci.c         | 126 +++++++++++++++++++---------------------
 drivers/pci/msi.c               |  20 ++-----
 include/linux/interrupt.h       |   5 ++
 4 files changed, 107 insertions(+), 80 deletions(-)

-- 
2.14.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once
  2019-01-03 22:50 [PATCHv2 0/4] NVMe IRQ sets fixups Keith Busch
@ 2019-01-03 22:50 ` Keith Busch
  2019-01-04  1:46   ` Ming Lei
  2019-01-03 22:50 ` [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2019-01-03 22:50 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci
  Cc: Keith Busch

The driver overwrites the intermediate nr_map assignments to
HCTX_MAX_TYPES, so remove those unnecessary temporary settings.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5a0bf6a24d50..98332d0a80f0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2291,9 +2291,6 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	if (!dev->ctrl.tagset) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
-		dev->tagset.nr_maps = 2; /* default + read */
-		if (dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.nr_maps++;
 		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-03 22:50 [PATCHv2 0/4] NVMe IRQ sets fixups Keith Busch
  2019-01-03 22:50 ` [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once Keith Busch
@ 2019-01-03 22:50 ` Keith Busch
  2019-01-04  2:31   ` Ming Lei
  2019-01-03 22:50 ` [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Keith Busch
  2019-01-03 22:50 ` [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ " Keith Busch
  3 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2019-01-03 22:50 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci
  Cc: Keith Busch

The dev->io_queues types were set based on the results of the nvme set
feature "number of queues" and the IRQ allocation. This result does not
mean we're going to successfully allocate and create those IO queues,
though. A failure there will cause blk-mq to have NULL hctx's because the
map's nr_hw_queues accounts for more queues than were actually created.

Adjust the io_queue types after we've created them when we have less than
originally desired.

Fixes: 3b6592f70ad7b ("nvme: utilize two queue maps, one for reads and one for writes")
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 98332d0a80f0..1481bb6d9c42 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,6 +1733,30 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	return result;
 }
 
+static void nvme_distribute_queues(struct nvme_dev *dev, unsigned int io_queues)
+{
+	unsigned int irq_queues, this_p_queues = dev->io_queues[HCTX_TYPE_POLL],
+		     this_w_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
+
+	if (!io_queues) {
+		dev->io_queues[HCTX_TYPE_POLL] = 0;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = 0;
+		dev->io_queues[HCTX_TYPE_READ] = 0;
+		return;
+	}
+
+	if (this_p_queues >= io_queues)
+		this_p_queues = io_queues - 1;
+	irq_queues = io_queues - this_p_queues;
+
+	if (this_w_queues > irq_queues)
+		this_w_queues = irq_queues;
+
+	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
+	dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
+	dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues;
+}
+
 static int nvme_create_io_queues(struct nvme_dev *dev)
 {
 	unsigned i, max, rw_queues;
@@ -1761,6 +1785,13 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 			break;
 	}
 
+	/*
+	 * If we've created less than expected io queues, redistribute the
+	 * dev->io_queues[] types accordingly.
+	 */
+	if (dev->online_queues - 1 != dev->max_qid)
+		nvme_distribute_queues(dev, dev->online_queues - 1);
+
 	/*
 	 * Ignore failing Create SQ/CQ commands, we can continue with less
 	 * than the desired amount of queues, and even a controller without
@@ -2185,11 +2216,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = max(result - 1, 1);
 	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
 
-	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
-					dev->io_queues[HCTX_TYPE_DEFAULT],
-					dev->io_queues[HCTX_TYPE_READ],
-					dev->io_queues[HCTX_TYPE_POLL]);
-
 	/*
 	 * Should investigate if there's a performance win from allocating
 	 * more queues than interrupt vectors; it might allow the submission
@@ -2203,7 +2229,15 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		return result;
 	}
 	set_bit(NVMEQ_ENABLED, &adminq->flags);
-	return nvme_create_io_queues(dev);
+	result = nvme_create_io_queues(dev);
+
+	if (!result)
+		dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+					dev->io_queues[HCTX_TYPE_DEFAULT],
+					dev->io_queues[HCTX_TYPE_READ],
+					dev->io_queues[HCTX_TYPE_POLL]);
+	return result;
+
 }
 
 static void nvme_del_queue_end(struct request *req, blk_status_t error)
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
  2019-01-03 22:50 [PATCHv2 0/4] NVMe IRQ sets fixups Keith Busch
  2019-01-03 22:50 ` [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once Keith Busch
  2019-01-03 22:50 ` [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation Keith Busch
@ 2019-01-03 22:50 ` Keith Busch
  2019-01-04  2:45   ` Ming Lei
  2019-01-04 22:35   ` Bjorn Helgaas
  2019-01-03 22:50 ` [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ " Keith Busch
  3 siblings, 2 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-03 22:50 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci
  Cc: Keith Busch

The struct irq_affinity nr_sets forced the driver to handle reducing the
vector count on allocation failures because the set distribution counts
are driver specific. The change to this API requires very different usage
than before, and introduced new error corner cases that weren't being
handled. It is also less efficient since the driver doesn't actually
know what a proper vector count it should use since it only sees the
error code and can only reduce by one instead of going straight to a
possible vector count like PCI is able to do.

Provide a driver specific callback for managed irq set creation so that
PCI can take a min and max vectors as before to handle the reduce and
retry logic.

The usage is not particularly obvious for this new feature, so append
documentation for driver usage.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
 drivers/pci/msi.c               | 20 ++++++--------------
 include/linux/interrupt.h       |  5 +++++
 3 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 618e13d5e276..391b1f369138 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -98,7 +98,41 @@ The flags argument is used to specify which type of interrupt can be used
 by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
 A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
 any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
-pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
+pci_alloc_irq_vectors() will spread the interrupts around the available
+CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
+managed by the kernel, and are not tunable from user space like other
+vectors.
+
+When your driver requires a more complex vector affinity configuration
+than a default spread of all vectors, the driver may use the following
+function:
+
+  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
+				   unsigned int max_vecs, unsigned int flags,
+				   const struct irq_affinity *affd);
+
+The 'struct irq_affinity *affd' allows a driver to specify additional
+characteristics for how a driver wants the vector management to occur. The
+'pre_vectors' and 'post_vectors' fields define how many vectors the driver
+wants to not participate in kernel managed affinities, and whether those
+special vectors are at the beginning or the end of the vector space.
+
+It may also be the case that a driver wants multiple sets of fully
+affinitized vectors. For example, a single PCI function may provide
+different high performance services that want full CPU affinity for each
+service independent of other services. In this case, the driver may use
+the struct irq_affinity's 'nr_sets' field to specify how many groups of
+vectors need to be spread across all the CPUs, and fill in the 'sets'
+array to say how many vectors the driver wants in each set.
+
+When using multiple affinity 'sets', the error handling for vector
+reduction and retry becomes more complicated since the PCI core
+doesn't know how to redistribute the vector count across the sets. In
+order to provide this error handling, the driver must also provide the
+'recalc_sets()' callback and set the 'priv' data needed for the driver
+specific vector distribution. The driver's callback is responsible to
+ensure the sum of the vector counts across its sets matches the new
+vector count that PCI can allocate.
 
 To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
 vectors, use the following function:
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 7a1c8a09efa5..b93ac49be18d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 	if (maxvec < minvec)
 		return -ERANGE;
 
-	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * vectors. The caller needs to handle that.
-	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
-		return -EINVAL;
-
 	if (WARN_ON_ONCE(dev->msi_enabled))
 		return -EINVAL;
 
@@ -1061,6 +1054,9 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 				return -ENOSPC;
 		}
 
+		if (nvec != maxvec && affd && affd->recalc_sets)
+			affd->recalc_sets((struct irq_affinity *)affd, nvec);
+
 		rc = msi_capability_init(dev, nvec, affd);
 		if (rc == 0)
 			return nvec;
@@ -1093,13 +1089,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 	if (maxvec < minvec)
 		return -ERANGE;
 
-	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * supported vectors. The caller needs to handle that.
-	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
-		return -EINVAL;
-
 	if (WARN_ON_ONCE(dev->msix_enabled))
 		return -EINVAL;
 
@@ -1110,6 +1099,9 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 				return -ENOSPC;
 		}
 
+		if (nvec != maxvec && affd && affd->recalc_sets)
+			affd->recalc_sets((struct irq_affinity *)affd, nvec);
+
 		rc = __pci_enable_msix(dev, entries, nvec, affd);
 		if (rc == 0)
 			return nvec;
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c672f34235e7..01c06829ff43 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -249,12 +249,17 @@ struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @nr_sets:		Length of passed in *sets array
  * @sets:		Number of affinitized sets
+ * @recalc_sets:	Recalculate sets if the previously requested allocation
+ *			failed
+ * @priv:		Driver private data
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
 	int	nr_sets;
 	int	*sets;
+	void	(*recalc_sets)(struct irq_affinity *, unsigned int);
+	void	*priv;
 };
 
 /**
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ reduce and retry
  2019-01-03 22:50 [PATCHv2 0/4] NVMe IRQ sets fixups Keith Busch
                   ` (2 preceding siblings ...)
  2019-01-03 22:50 ` [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Keith Busch
@ 2019-01-03 22:50 ` Keith Busch
  2019-01-04  2:41   ` Ming Lei
  2019-01-04 18:19   ` Christoph Hellwig
  3 siblings, 2 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-03 22:50 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci
  Cc: Keith Busch

Restore error handling for vector allocation back to the PCI core.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 77 ++++++++++++++-----------------------------------
 1 file changed, 21 insertions(+), 56 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1481bb6d9c42..f3ef09a8e8f9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2059,37 +2059,43 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
 	return ret;
 }
 
-static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
+static void nvme_calc_io_queues(struct irq_affinity *affd, unsigned int nvecs)
 {
+	struct nvme_dev *dev = affd->priv;
 	unsigned int this_w_queues = write_queues;
 
 	/*
 	 * Setup read/write queue split
 	 */
-	if (irq_queues == 1) {
+	if (nvecs == 1) {
 		dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
 		dev->io_queues[HCTX_TYPE_READ] = 0;
-		return;
+		goto set_sets;
 	}
 
 	/*
 	 * If 'write_queues' is set, ensure it leaves room for at least
 	 * one read queue
 	 */
-	if (this_w_queues >= irq_queues)
-		this_w_queues = irq_queues - 1;
+	if (this_w_queues >= nvecs - 1)
+		this_w_queues = nvecs - 1;
 
 	/*
 	 * If 'write_queues' is set to zero, reads and writes will share
 	 * a queue set.
 	 */
 	if (!this_w_queues) {
-		dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = nvecs - 1;
 		dev->io_queues[HCTX_TYPE_READ] = 0;
 	} else {
 		dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
-		dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues;
+		dev->io_queues[HCTX_TYPE_READ] = nvecs - this_w_queues - 1;
 	}
+set_sets:
+	affd->sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
+	affd->sets[1] = dev->io_queues[HCTX_TYPE_READ];
+	if (!affd->sets[1])
+		affd->nr_sets = 1;
 }
 
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
@@ -2100,9 +2106,10 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		.pre_vectors = 1,
 		.nr_sets = ARRAY_SIZE(irq_sets),
 		.sets = irq_sets,
+		.recalc_sets = nvme_calc_io_queues,
+		.priv = dev,
 	};
-	int result = 0;
-	unsigned int irq_queues, this_p_queues;
+	unsigned int nvecs, this_p_queues;
 
 	/*
 	 * Poll queues don't need interrupts, but we need at least one IO
@@ -2111,56 +2118,14 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	this_p_queues = poll_queues;
 	if (this_p_queues >= nr_io_queues) {
 		this_p_queues = nr_io_queues - 1;
-		irq_queues = 1;
+		nvecs = 2;
 	} else {
-		irq_queues = nr_io_queues - this_p_queues;
+		nvecs = nr_io_queues - this_p_queues + 1;
 	}
 	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
-
-	/*
-	 * For irq sets, we have to ask for minvec == maxvec. This passes
-	 * any reduction back to us, so we can adjust our queue counts and
-	 * IRQ vector needs.
-	 */
-	do {
-		nvme_calc_io_queues(dev, irq_queues);
-		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
-		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
-		if (!irq_sets[1])
-			affd.nr_sets = 1;
-
-		/*
-		 * If we got a failure and we're down to asking for just
-		 * 1 + 1 queues, just ask for a single vector. We'll share
-		 * that between the single IO queue and the admin queue.
-		 */
-		if (result >= 0 && irq_queues > 1)
-			irq_queues = irq_sets[0] + irq_sets[1] + 1;
-
-		result = pci_alloc_irq_vectors_affinity(pdev, irq_queues,
-				irq_queues,
-				PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
-
-		/*
-		 * Need to reduce our vec counts. If we get ENOSPC, the
-		 * platform should support mulitple vecs, we just need
-		 * to decrease our ask. If we get EINVAL, the platform
-		 * likely does not. Back down to ask for just one vector.
-		 */
-		if (result == -ENOSPC) {
-			irq_queues--;
-			if (!irq_queues)
-				return result;
-			continue;
-		} else if (result == -EINVAL) {
-			irq_queues = 1;
-			continue;
-		} else if (result <= 0)
-			return -EIO;
-		break;
-	} while (1);
-
-	return result;
+	nvme_calc_io_queues(&affd, nvecs);
+	return pci_alloc_irq_vectors_affinity(pdev, affd.pre_vectors, nvecs,
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once
  2019-01-03 22:50 ` [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once Keith Busch
@ 2019-01-04  1:46   ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-01-04  1:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Thu, Jan 03, 2019 at 03:50:30PM -0700, Keith Busch wrote:
> The driver overwrites the intermediate nr_map assignments to
> HCTX_MAX_TYPES, so remove those unnecessary temporary settings.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/pci.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5a0bf6a24d50..98332d0a80f0 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2291,9 +2291,6 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  	if (!dev->ctrl.tagset) {
>  		dev->tagset.ops = &nvme_mq_ops;
>  		dev->tagset.nr_hw_queues = dev->online_queues - 1;
> -		dev->tagset.nr_maps = 2; /* default + read */
> -		if (dev->io_queues[HCTX_TYPE_POLL])
> -			dev->tagset.nr_maps++;
>  		dev->tagset.nr_maps = HCTX_MAX_TYPES;
>  		dev->tagset.timeout = NVME_IO_TIMEOUT;
>  		dev->tagset.numa_node = dev_to_node(dev->dev);
> -- 
> 2.14.4
> 

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-03 22:50 ` [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation Keith Busch
@ 2019-01-04  2:31   ` Ming Lei
  2019-01-04  7:21     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-01-04  2:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Thu, Jan 03, 2019 at 03:50:31PM -0700, Keith Busch wrote:
> The dev->io_queues types were set based on the results of the nvme set
> feature "number of queues" and the IRQ allocation. This result does not
> mean we're going to successfully allocate and create those IO queues,
> though. A failure there will cause blk-mq to have NULL hctx's because the
> map's nr_hw_queues accounts for more queues than were actually created.
> 
> Adjust the io_queue types after we've created them when we have less than
> originally desired.
> 
> Fixes: 3b6592f70ad7b ("nvme: utilize two queue maps, one for reads and one for writes")
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/pci.c | 46 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 98332d0a80f0..1481bb6d9c42 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1733,6 +1733,30 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
>  	return result;
>  }
>  
> +static void nvme_distribute_queues(struct nvme_dev *dev, unsigned int io_queues)
> +{
> +	unsigned int irq_queues, this_p_queues = dev->io_queues[HCTX_TYPE_POLL],
> +		     this_w_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
> +
> +	if (!io_queues) {
> +		dev->io_queues[HCTX_TYPE_POLL] = 0;
> +		dev->io_queues[HCTX_TYPE_DEFAULT] = 0;
> +		dev->io_queues[HCTX_TYPE_READ] = 0;
> +		return;
> +	}
> +
> +	if (this_p_queues >= io_queues)
> +		this_p_queues = io_queues - 1;
> +	irq_queues = io_queues - this_p_queues;
> +
> +	if (this_w_queues > irq_queues)
> +		this_w_queues = irq_queues;
> +
> +	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
> +	dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
> +	dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues;
> +}
> +
>  static int nvme_create_io_queues(struct nvme_dev *dev)
>  {
>  	unsigned i, max, rw_queues;
> @@ -1761,6 +1785,13 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
>  			break;
>  	}
>  
> +	/*
> +	 * If we've created less than expected io queues, redistribute the
> +	 * dev->io_queues[] types accordingly.
> +	 */
> +	if (dev->online_queues - 1 != dev->max_qid)
> +		nvme_distribute_queues(dev, dev->online_queues - 1);
> +
>  	/*
>  	 * Ignore failing Create SQ/CQ commands, we can continue with less
>  	 * than the desired amount of queues, and even a controller without
> @@ -2185,11 +2216,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>  	result = max(result - 1, 1);
>  	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
>  
> -	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> -					dev->io_queues[HCTX_TYPE_DEFAULT],
> -					dev->io_queues[HCTX_TYPE_READ],
> -					dev->io_queues[HCTX_TYPE_POLL]);
> -
>  	/*
>  	 * Should investigate if there's a performance win from allocating
>  	 * more queues than interrupt vectors; it might allow the submission
> @@ -2203,7 +2229,15 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>  		return result;
>  	}
>  	set_bit(NVMEQ_ENABLED, &adminq->flags);
> -	return nvme_create_io_queues(dev);
> +	result = nvme_create_io_queues(dev);
> +
> +	if (!result)
> +		dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> +					dev->io_queues[HCTX_TYPE_DEFAULT],
> +					dev->io_queues[HCTX_TYPE_READ],
> +					dev->io_queues[HCTX_TYPE_POLL]);
> +	return result;
> +
>  }
>  
>  static void nvme_del_queue_end(struct request *req, blk_status_t error)
> -- 
> 2.14.4
> 

This way should be better given it covers irq allocation failure and
queue creating/initialization failure.

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ reduce and retry
  2019-01-03 22:50 ` [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ " Keith Busch
@ 2019-01-04  2:41   ` Ming Lei
  2019-01-04 18:19   ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-01-04  2:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Thu, Jan 03, 2019 at 03:50:33PM -0700, Keith Busch wrote:
> Restore error handling for vector allocation back to the PCI core.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/pci.c | 77 ++++++++++++++-----------------------------------
>  1 file changed, 21 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1481bb6d9c42..f3ef09a8e8f9 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2059,37 +2059,43 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
>  	return ret;
>  }
>  
> -static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
> +static void nvme_calc_io_queues(struct irq_affinity *affd, unsigned int nvecs)
>  {
> +	struct nvme_dev *dev = affd->priv;
>  	unsigned int this_w_queues = write_queues;
>  
>  	/*
>  	 * Setup read/write queue split
>  	 */
> -	if (irq_queues == 1) {
> +	if (nvecs == 1) {

The above line can be 'nvecs <= 2', cause when nvecs is 2, one is for
admin queue, another can be for DEFAULT.

>  		dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
>  		dev->io_queues[HCTX_TYPE_READ] = 0;
> -		return;
> +		goto set_sets;
>  	}
>  
>  	/*
>  	 * If 'write_queues' is set, ensure it leaves room for at least
>  	 * one read queue
>  	 */
> -	if (this_w_queues >= irq_queues)
> -		this_w_queues = irq_queues - 1;
> +	if (this_w_queues >= nvecs - 1)
> +		this_w_queues = nvecs - 1;

If we want to leave room for one read queue, 'this_w_queues' should be
set as 'nvecs - 2' given nvecs covers admin queue.

>  
>  	/*
>  	 * If 'write_queues' is set to zero, reads and writes will share
>  	 * a queue set.
>  	 */
>  	if (!this_w_queues) {
> -		dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues;
> +		dev->io_queues[HCTX_TYPE_DEFAULT] = nvecs - 1;
>  		dev->io_queues[HCTX_TYPE_READ] = 0;
>  	} else {
>  		dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
> -		dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues;
> +		dev->io_queues[HCTX_TYPE_READ] = nvecs - this_w_queues - 1;
>  	}

In above change, looks you starts to consider admin queue vector, which
is obvious one issue in current code.

So I'd suggest to fix nvme_calc_io_queues() in one standalone patch,
just what I posted, given this patch doesn't do "Restore error
handling for vector allocation back to the PCI core" only.

http://lists.infradead.org/pipermail/linux-nvme/2018-December/021879.html

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
  2019-01-03 22:50 ` [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Keith Busch
@ 2019-01-04  2:45   ` Ming Lei
  2019-01-04 22:35   ` Bjorn Helgaas
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-01-04  2:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> The struct irq_affinity nr_sets forced the driver to handle reducing the
> vector count on allocation failures because the set distribution counts
> are driver specific. The change to this API requires very different usage
> than before, and introduced new error corner cases that weren't being
> handled. It is also less efficient since the driver doesn't actually
> know what a proper vector count it should use since it only sees the
> error code and can only reduce by one instead of going straight to a
> possible vector count like PCI is able to do.
> 
> Provide a driver specific callback for managed irq set creation so that
> PCI can take a min and max vectors as before to handle the reduce and
> retry logic.
> 
> The usage is not particularly obvious for this new feature, so append
> documentation for driver usage.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi.c               | 20 ++++++--------------
>  include/linux/interrupt.h       |  5 +++++
>  3 files changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 618e13d5e276..391b1f369138 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -98,7 +98,41 @@ The flags argument is used to specify which type of interrupt can be used
>  by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
>  A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
>  any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
> -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
> +pci_alloc_irq_vectors() will spread the interrupts around the available
> +CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
> +managed by the kernel, and are not tunable from user space like other
> +vectors.
> +
> +When your driver requires a more complex vector affinity configuration
> +than a default spread of all vectors, the driver may use the following
> +function:
> +
> +  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> +				   unsigned int max_vecs, unsigned int flags,
> +				   const struct irq_affinity *affd);
> +
> +The 'struct irq_affinity *affd' allows a driver to specify additional
> +characteristics for how a driver wants the vector management to occur. The
> +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> +wants to not participate in kernel managed affinities, and whether those
> +special vectors are at the beginning or the end of the vector space.
> +
> +It may also be the case that a driver wants multiple sets of fully
> +affinitized vectors. For example, a single PCI function may provide
> +different high performance services that want full CPU affinity for each
> +service independent of other services. In this case, the driver may use
> +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> +vectors need to be spread across all the CPUs, and fill in the 'sets'
> +array to say how many vectors the driver wants in each set.
> +
> +When using multiple affinity 'sets', the error handling for vector
> +reduction and retry becomes more complicated since the PCI core
> +doesn't know how to redistribute the vector count across the sets. In
> +order to provide this error handling, the driver must also provide the
> +'recalc_sets()' callback and set the 'priv' data needed for the driver
> +specific vector distribution. The driver's callback is responsible to
> +ensure the sum of the vector counts across its sets matches the new
> +vector count that PCI can allocate.
>  
>  To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
>  vectors, use the following function:
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 7a1c8a09efa5..b93ac49be18d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msi_enabled))
>  		return -EINVAL;
>  
> @@ -1061,6 +1054,9 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +

->recalc_sets() should have been done after msi_capability_init() makes at least
'minvec' is available.

>  		rc = msi_capability_init(dev, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> @@ -1093,13 +1089,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * supported vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msix_enabled))
>  		return -EINVAL;
>  
> @@ -1110,6 +1099,9 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +

->recalc_sets() should have been done after __pci_enable_msix() makes at least
'minvec' is available.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-04  2:31   ` Ming Lei
@ 2019-01-04  7:21     ` Ming Lei
  2019-01-04 15:53       ` Keith Busch
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-01-04  7:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Fri, Jan 04, 2019 at 10:31:21AM +0800, Ming Lei wrote:
> On Thu, Jan 03, 2019 at 03:50:31PM -0700, Keith Busch wrote:
> > The dev->io_queues types were set based on the results of the nvme set
> > feature "number of queues" and the IRQ allocation. This result does not
> > mean we're going to successfully allocate and create those IO queues,
> > though. A failure there will cause blk-mq to have NULL hctx's because the
> > map's nr_hw_queues accounts for more queues than were actually created.
> > 
> > Adjust the io_queue types after we've created them when we have less than
> > originally desired.
> > 
> > Fixes: 3b6592f70ad7b ("nvme: utilize two queue maps, one for reads and one for writes")
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/nvme/host/pci.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 98332d0a80f0..1481bb6d9c42 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1733,6 +1733,30 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
> >  	return result;
> >  }
> >  
> > +static void nvme_distribute_queues(struct nvme_dev *dev, unsigned int io_queues)
> > +{
> > +	unsigned int irq_queues, this_p_queues = dev->io_queues[HCTX_TYPE_POLL],
> > +		     this_w_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
> > +
> > +	if (!io_queues) {
> > +		dev->io_queues[HCTX_TYPE_POLL] = 0;
> > +		dev->io_queues[HCTX_TYPE_DEFAULT] = 0;
> > +		dev->io_queues[HCTX_TYPE_READ] = 0;
> > +		return;
> > +	}
> > +
> > +	if (this_p_queues >= io_queues)
> > +		this_p_queues = io_queues - 1;
> > +	irq_queues = io_queues - this_p_queues;
> > +
> > +	if (this_w_queues > irq_queues)
> > +		this_w_queues = irq_queues;
> > +
> > +	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
> > +	dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
> > +	dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues;
> > +}
> > +
> >  static int nvme_create_io_queues(struct nvme_dev *dev)
> >  {
> >  	unsigned i, max, rw_queues;
> > @@ -1761,6 +1785,13 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
> >  			break;
> >  	}
> >  
> > +	/*
> > +	 * If we've created less than expected io queues, redistribute the
> > +	 * dev->io_queues[] types accordingly.
> > +	 */
> > +	if (dev->online_queues - 1 != dev->max_qid)
> > +		nvme_distribute_queues(dev, dev->online_queues - 1);
> > +
> >  	/*
> >  	 * Ignore failing Create SQ/CQ commands, we can continue with less
> >  	 * than the desired amount of queues, and even a controller without
> > @@ -2185,11 +2216,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> >  	result = max(result - 1, 1);
> >  	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
> >  
> > -	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> > -					dev->io_queues[HCTX_TYPE_DEFAULT],
> > -					dev->io_queues[HCTX_TYPE_READ],
> > -					dev->io_queues[HCTX_TYPE_POLL]);
> > -
> >  	/*
> >  	 * Should investigate if there's a performance win from allocating
> >  	 * more queues than interrupt vectors; it might allow the submission
> > @@ -2203,7 +2229,15 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> >  		return result;
> >  	}
> >  	set_bit(NVMEQ_ENABLED, &adminq->flags);
> > -	return nvme_create_io_queues(dev);
> > +	result = nvme_create_io_queues(dev);
> > +
> > +	if (!result)
> > +		dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> > +					dev->io_queues[HCTX_TYPE_DEFAULT],
> > +					dev->io_queues[HCTX_TYPE_READ],
> > +					dev->io_queues[HCTX_TYPE_POLL]);
> > +	return result;
> > +
> >  }
> >  
> >  static void nvme_del_queue_end(struct request *req, blk_status_t error)
> > -- 
> > 2.14.4
> > 
> 
> This way should be better given it covers irq allocation failure and
> queue creating/initialization failure.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thinking about the patch further: after pci_alloc_irq_vectors_affinity()
is returned, queue number for non-polled queues can't be changed at will,
because we have to make sure to spread all CPUs on each queue type, and
the mapping has been fixed by pci_alloc_irq_vectors_affinity() already.

So looks the approach in this patch may be wrong.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-04  7:21     ` Ming Lei
@ 2019-01-04 15:53       ` Keith Busch
  2019-01-04 18:17         ` Christoph Hellwig
  2019-01-06  2:56         ` Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-04 15:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Fri, Jan 04, 2019 at 03:21:07PM +0800, Ming Lei wrote:
> Thinking about the patch further: after pci_alloc_irq_vectors_affinity()
> is returned, queue number for non-polled queues can't be changed at will,
> because we have to make sure to spread all CPUs on each queue type, and
> the mapping has been fixed by pci_alloc_irq_vectors_affinity() already.
> 
> So looks the approach in this patch may be wrong.

That's a bit of a problem, and not a new one. We always had to allocate
vectors before creating IRQ driven CQ's, but the vector affinity is
created before we know if the queue-pair can be created. Should the
queue creation fail, there may be CPUs that don't have a queue.

Does this mean the pci msi API is wrong? It seems like we'd need to
initially allocate vectors without PCI_IRQ_AFFINITY, then have the
kernel set affinity only after completing the queue-pair setup.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-04 15:53       ` Keith Busch
@ 2019-01-04 18:17         ` Christoph Hellwig
  2019-01-04 18:35           ` Keith Busch
  2019-01-06  2:56         ` Ming Lei
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2019-01-04 18:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, Bjorn Helgaas, linux-pci

On Fri, Jan 04, 2019 at 08:53:24AM -0700, Keith Busch wrote:
> On Fri, Jan 04, 2019 at 03:21:07PM +0800, Ming Lei wrote:
> > Thinking about the patch further: after pci_alloc_irq_vectors_affinity()
> > is returned, queue number for non-polled queues can't be changed at will,
> > because we have to make sure to spread all CPUs on each queue type, and
> > the mapping has been fixed by pci_alloc_irq_vectors_affinity() already.
> > 
> > So looks the approach in this patch may be wrong.
> 
> That's a bit of a problem, and not a new one. We always had to allocate
> vectors before creating IRQ driven CQ's, but the vector affinity is
> created before we know if the queue-pair can be created. Should the
> queue creation fail, there may be CPUs that don't have a queue.
> 
> Does this mean the pci msi API is wrong? It seems like we'd need to
> initially allocate vectors without PCI_IRQ_AFFINITY, then have the
> kernel set affinity only after completing the queue-pair setup.

We can't just easily do that, as we want to allocate the memory for
the descriptors on the correct node.  But we can just free the
vectors and try again if we have to.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ reduce and retry
  2019-01-03 22:50 ` [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ " Keith Busch
  2019-01-04  2:41   ` Ming Lei
@ 2019-01-04 18:19   ` Christoph Hellwig
  2019-01-04 18:33     ` Keith Busch
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2019-01-04 18:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, Bjorn Helgaas, linux-pci

I can't say I am a huge fan of the complex callback.  If we just made
the number of read vs write queues a factor instead of invidual
scalar numbers we could just handle this in the irq code without
the callback, and the concept might actually be understandable by
mere humans..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ reduce and retry
  2019-01-04 18:19   ` Christoph Hellwig
@ 2019-01-04 18:33     ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-04 18:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Ming Lei, linux-nvme, Bjorn Helgaas,
	linux-pci

On Fri, Jan 04, 2019 at 07:19:38PM +0100, Christoph Hellwig wrote:
> I can't say I am a huge fan of the complex callback.  If we just made
> the number of read vs write queues a factor instead of invidual
> scalar numbers we could just handle this in the irq code without
> the callback, and the concept might actually be understandable by
> mere humans..

Okay, we could express this as a ratio. I'll explore that path a bit
more.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-04 18:17         ` Christoph Hellwig
@ 2019-01-04 18:35           ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-04 18:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, Sagi Grimberg, linux-nvme, Bjorn Helgaas,
	linux-pci

On Fri, Jan 04, 2019 at 07:17:26PM +0100, Christoph Hellwig wrote:
> On Fri, Jan 04, 2019 at 08:53:24AM -0700, Keith Busch wrote:
> > On Fri, Jan 04, 2019 at 03:21:07PM +0800, Ming Lei wrote:
> > > Thinking about the patch further: after pci_alloc_irq_vectors_affinity()
> > > is returned, queue number for non-polled queues can't be changed at will,
> > > because we have to make sure to spread all CPUs on each queue type, and
> > > the mapping has been fixed by pci_alloc_irq_vectors_affinity() already.
> > > 
> > > So looks the approach in this patch may be wrong.
> > 
> > That's a bit of a problem, and not a new one. We always had to allocate
> > vectors before creating IRQ driven CQ's, but the vector affinity is
> > created before we know if the queue-pair can be created. Should the
> > queue creation fail, there may be CPUs that don't have a queue.
> > 
> > Does this mean the pci msi API is wrong? It seems like we'd need to
> > initially allocate vectors without PCI_IRQ_AFFINITY, then have the
> > kernel set affinity only after completing the queue-pair setup.
> 
> We can't just easily do that, as we want to allocate the memory for
> the descriptors on the correct node.  But we can just free the
> vectors and try again if we have to.

I've come to the same realization that switching modes after allocation
can't be easily accomodated. Teardown and retry with a reduced queue
count looks like the easiest solution.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
  2019-01-03 22:50 ` [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Keith Busch
  2019-01-04  2:45   ` Ming Lei
@ 2019-01-04 22:35   ` Bjorn Helgaas
  2019-01-04 22:56     ` Keith Busch
  1 sibling, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2019-01-04 22:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, linux-pci

On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> The struct irq_affinity nr_sets forced the driver to handle reducing the
> vector count on allocation failures because the set distribution counts
> are driver specific. The change to this API requires very different usage
> than before, and introduced new error corner cases that weren't being
> handled. It is also less efficient since the driver doesn't actually
> know what a proper vector count it should use since it only sees the
> error code and can only reduce by one instead of going straight to a
> possible vector count like PCI is able to do.
> 
> Provide a driver specific callback for managed irq set creation so that
> PCI can take a min and max vectors as before to handle the reduce and
> retry logic.
> 
> The usage is not particularly obvious for this new feature, so append
> documentation for driver usage.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi.c               | 20 ++++++--------------
>  include/linux/interrupt.h       |  5 +++++
>  3 files changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 618e13d5e276..391b1f369138 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -98,7 +98,41 @@ The flags argument is used to specify which type of interrupt can be used
>  by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
>  A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
>  any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
> -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
> +pci_alloc_irq_vectors() will spread the interrupts around the available
> +CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
> +managed by the kernel, and are not tunable from user space like other
> +vectors.
> +
> +When your driver requires a more complex vector affinity configuration
> +than a default spread of all vectors, the driver may use the following
> +function:
> +
> +  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> +				   unsigned int max_vecs, unsigned int flags,
> +				   const struct irq_affinity *affd);
> +
> +The 'struct irq_affinity *affd' allows a driver to specify additional
> +characteristics for how a driver wants the vector management to occur. The
> +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> +wants to not participate in kernel managed affinities, and whether those
> +special vectors are at the beginning or the end of the vector space.

How are the pre_vectors and post_vectors handled?  Do they get
assigned to random CPUs?  Current CPU?  Are their assignments tunable
from user space?

> +It may also be the case that a driver wants multiple sets of fully
> +affinitized vectors. For example, a single PCI function may provide
> +different high performance services that want full CPU affinity for each
> +service independent of other services. In this case, the driver may use
> +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> +vectors need to be spread across all the CPUs, and fill in the 'sets'
> +array to say how many vectors the driver wants in each set.

I think the issue here is IRQ vectors, and "services" and whether
they're high performance are unnecessary concepts.

What does irq_affinity.sets point to?  I guess it's a table of
integers where the table size is the number of sets and each entry is
the number of vectors in the set?

So we'd have something like this:

  pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
  set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
  set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
  ...
  post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)

where the vectors in set0 are spread across all CPUs, those in set1
are independently spread across all CPUs, etc?

I would guess there may be device-specific restrictions on the mapping
of of these vectors to sets, so the PCI core probably can't assume the
sets can be of arbitrary size, contiguous, etc.

> +When using multiple affinity 'sets', the error handling for vector
> +reduction and retry becomes more complicated since the PCI core
> +doesn't know how to redistribute the vector count across the sets. In
> +order to provide this error handling, the driver must also provide the
> +'recalc_sets()' callback and set the 'priv' data needed for the driver
> +specific vector distribution. The driver's callback is responsible to
> +ensure the sum of the vector counts across its sets matches the new
> +vector count that PCI can allocate.
>  
>  To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
>  vectors, use the following function:
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 7a1c8a09efa5..b93ac49be18d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msi_enabled))
>  		return -EINVAL;
>  
> @@ -1061,6 +1054,9 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +
>  		rc = msi_capability_init(dev, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> @@ -1093,13 +1089,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * supported vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msix_enabled))
>  		return -EINVAL;
>  
> @@ -1110,6 +1099,9 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +
>  		rc = __pci_enable_msix(dev, entries, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c672f34235e7..01c06829ff43 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -249,12 +249,17 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @nr_sets:		Length of passed in *sets array
>   * @sets:		Number of affinitized sets
> + * @recalc_sets:	Recalculate sets if the previously requested allocation
> + *			failed
> + * @priv:		Driver private data
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
>  	int	nr_sets;
>  	int	*sets;
> +	void	(*recalc_sets)(struct irq_affinity *, unsigned int);
> +	void	*priv;
>  };
>  
>  /**
> -- 
> 2.14.4
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
  2019-01-04 22:35   ` Bjorn Helgaas
@ 2019-01-04 22:56     ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2019-01-04 22:56 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei,
	linux-nvme, linux-pci

On Fri, Jan 04, 2019 at 04:35:31PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> > +The 'struct irq_affinity *affd' allows a driver to specify additional
> > +characteristics for how a driver wants the vector management to occur. The
> > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> > +wants to not participate in kernel managed affinities, and whether those
> > +special vectors are at the beginning or the end of the vector space.
> 
> How are the pre_vectors and post_vectors handled?  Do they get
> assigned to random CPUs?  Current CPU?  Are their assignments tunable
> from user space?

Point taken. Those do get assigned a default mask, but they are also
user tunable and kernel migratable when CPUs offline/online.
 
> > +It may also be the case that a driver wants multiple sets of fully
> > +affinitized vectors. For example, a single PCI function may provide
> > +different high performance services that want full CPU affinity for each
> > +service independent of other services. In this case, the driver may use
> > +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> > +vectors need to be spread across all the CPUs, and fill in the 'sets'
> > +array to say how many vectors the driver wants in each set.
> 
> I think the issue here is IRQ vectors, and "services" and whether
> they're high performance are unnecessary concepts.

It's really intended for when your device has resources optimally accessed
in a per-cpu manner. I can better rephrase this description.

> What does irq_affinity.sets point to?  I guess it's a table of
> integers where the table size is the number of sets and each entry is
> the number of vectors in the set?
>
> So we'd have something like this:
> 
>   pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
>   set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
>   set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
>   ...
>   post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)
> 
> where the vectors in set0 are spread across all CPUs, those in set1
> are independently spread across all CPUs, etc?
>
> I would guess there may be device-specific restrictions on the mapping
> of of these vectors to sets, so the PCI core probably can't assume the
> sets can be of arbitrary size, contiguous, etc.

I think it's fair to say the caller wants vectors allocated and each set
affinitized contiguously such that each set starts after the previous
one ends. That works great with how NVMe wants to use it, at least. If
there is really any other way a device driver wants it, I can't see how
that can be easily accomodated.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation
  2019-01-04 15:53       ` Keith Busch
  2019-01-04 18:17         ` Christoph Hellwig
@ 2019-01-06  2:56         ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-01-06  2:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Bjorn Helgaas, linux-pci

On Fri, Jan 04, 2019 at 08:53:24AM -0700, Keith Busch wrote:
> On Fri, Jan 04, 2019 at 03:21:07PM +0800, Ming Lei wrote:
> > Thinking about the patch further: after pci_alloc_irq_vectors_affinity()
> > is returned, queue number for non-polled queues can't be changed at will,
> > because we have to make sure to spread all CPUs on each queue type, and
> > the mapping has been fixed by pci_alloc_irq_vectors_affinity() already.
> > 
> > So looks the approach in this patch may be wrong.
> 
> That's a bit of a problem, and not a new one. We always had to allocate
> vectors before creating IRQ driven CQ's, but the vector affinity is
> created before we know if the queue-pair can be created. Should the
> queue creation fail, there may be CPUs that don't have a queue.
> 
> Does this mean the pci msi API is wrong? It seems like we'd need to
> initially allocate vectors without PCI_IRQ_AFFINITY, then have the
> kernel set affinity only after completing the queue-pair setup.

I think this kind of API style(two stages) is more clean, and
error-immune.

- pci_alloc_irq_vectors() is only for allocating irq vectors
- pci_set_irq_vectors_affnity() is for spreading affinity at will


Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-01-06  2:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-03 22:50 [PATCHv2 0/4] NVMe IRQ sets fixups Keith Busch
2019-01-03 22:50 ` [PATCHv2 1/4] nvme-pci: Set tagset nr_maps just once Keith Busch
2019-01-04  1:46   ` Ming Lei
2019-01-03 22:50 ` [PATCHv2 2/4] nvme-pci: Distribute io queue types after creation Keith Busch
2019-01-04  2:31   ` Ming Lei
2019-01-04  7:21     ` Ming Lei
2019-01-04 15:53       ` Keith Busch
2019-01-04 18:17         ` Christoph Hellwig
2019-01-04 18:35           ` Keith Busch
2019-01-06  2:56         ` Ming Lei
2019-01-03 22:50 ` [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Keith Busch
2019-01-04  2:45   ` Ming Lei
2019-01-04 22:35   ` Bjorn Helgaas
2019-01-04 22:56     ` Keith Busch
2019-01-03 22:50 ` [PATCHv2 4/4] nvme-pci: Use PCI to handle IRQ " Keith Busch
2019-01-04  2:41   ` Ming Lei
2019-01-04 18:19   ` Christoph Hellwig
2019-01-04 18:33     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).