[PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread
@ 2019-02-13 10:50 Ming Lei
  2019-02-13 10:50 ` [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Hi,

Currently pre-caculated set vectors are provided by driver for
allocating & spread vectors. This way only works when drivers passes
same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
also requires driver to retry the allocating & spread.

As Bjorn and Keith mentioned, the current usage & interface for irq sets
is a bit awkward because the retrying should have been avoided by providing
one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
'max_vecs', number of the allocated vectors is unknown before calling
pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
pre-caculated.

Add a new callback of .calc_sets into 'struct irq_affinity' so that
driver can caculate set vectors after IRQ vector is allocated and
before spread IRQ vectors.

V3:
	- don't mark 'affd' as const in related functions
	- add sanity check on affd->nr_sets
	- remove the local variable of 'nr_sets' in irq_create_affinity_masks
	- set .nr_sets as 2 in nvme
	- update comment in msi.c

V2:
	- add .calc_sets instead of .setup_affinity() which is easy to
	be abused by drivers

Ming Lei (5):
  genirq/affinity: don't mark 'affd' as const
  genirq/affinity: store irq set vectors in 'struct irq_affinity'
  genirq/affinity: add new callback for caculating set vectors
  nvme-pci: avoid irq allocation retrying via .calc_sets
  genirq/affinity: Document .calc_sets as required in case of multiple
    sets

 drivers/nvme/host/pci.c   | 63 ++++++++++++-----------------------------------
 drivers/pci/msi.c         | 34 ++++++++++++++-----------
 include/linux/interrupt.h | 13 +++++++---
 include/linux/pci.h       |  4 +--
 kernel/irq/affinity.c     | 26 ++++++++++++-------
 5 files changed, 64 insertions(+), 76 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
@ 2019-02-13 10:50 ` Ming Lei
  2019-02-13 15:04   ` Bjorn Helgaas
  2019-02-13 10:50 ` [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity' Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Currently all parameters in 'affd' are read-only, so 'affd' is marked
as const in both pci_alloc_irq_vectors_affinity() and irq_create_affinity_masks().

We have to ask driver to re-caculate set vectors after the whole IRQ
vectors are allocated later, and the result needs to be stored in 'affd'.
Also both the two interfaces are core APIs, which should be trusted.

So don't mark 'affd' as const both pci_alloc_irq_vectors_affinity() and
irq_create_affinity_masks().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/pci/msi.c         | 18 +++++++++---------
 include/linux/interrupt.h |  2 +-
 include/linux/pci.h       |  4 ++--
 kernel/irq/affinity.c     |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4c0b47867258..96978459e2a0 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -532,7 +532,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 }
 
 static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
 {
 	struct irq_affinity_desc *masks = NULL;
 	struct msi_desc *entry;
@@ -597,7 +597,7 @@ static int msi_verify_entries(struct pci_dev *dev)
  * which could have been allocated.
  */
 static int msi_capability_init(struct pci_dev *dev, int nvec,
-			       const struct irq_affinity *affd)
+			       struct irq_affinity *affd)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -669,7 +669,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
 
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			      struct msix_entry *entries, int nvec,
-			      const struct irq_affinity *affd)
+			      struct irq_affinity *affd)
 {
 	struct irq_affinity_desc *curmsk, *masks = NULL;
 	struct msi_desc *entry;
@@ -736,7 +736,7 @@ static void msix_program_entries(struct pci_dev *dev,
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
-				int nvec, const struct irq_affinity *affd)
+				int nvec, struct irq_affinity *affd)
 {
 	int ret;
 	u16 control;
@@ -932,7 +932,7 @@ int pci_msix_vec_count(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_msix_vec_count);
 
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-			     int nvec, const struct irq_affinity *affd)
+			     int nvec, struct irq_affinity *affd)
 {
 	int nr_entries;
 	int i, j;
@@ -1018,7 +1018,7 @@ int pci_msi_enabled(void)
 EXPORT_SYMBOL(pci_msi_enabled);
 
 static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
-				  const struct irq_affinity *affd)
+				  struct irq_affinity *affd)
 {
 	int nvec;
 	int rc;
@@ -1086,7 +1086,7 @@ EXPORT_SYMBOL(pci_enable_msi);
 
 static int __pci_enable_msix_range(struct pci_dev *dev,
 				   struct msix_entry *entries, int minvec,
-				   int maxvec, const struct irq_affinity *affd)
+				   int maxvec, struct irq_affinity *affd)
 {
 	int rc, nvec = maxvec;
 
@@ -1165,9 +1165,9 @@ EXPORT_SYMBOL(pci_enable_msix_range);
  */
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
-				   const struct irq_affinity *affd)
+				   struct irq_affinity *affd)
 {
-	static const struct irq_affinity msi_default_affd;
+	struct irq_affinity msi_default_affd = {0};
 	int msix_vecs = -ENOSPC;
 	int msi_vecs = -ENOSPC;
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 7c9434652f36..1ed1014c9684 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -332,7 +332,7 @@ extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
 struct irq_affinity_desc *
-irq_create_affinity_masks(int nvec, const struct irq_affinity *affd);
+irq_create_affinity_masks(int nvec, struct irq_affinity *affd);
 
 int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 40b327b814aa..4eca42cf611b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1396,7 +1396,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 }
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
-				   const struct irq_affinity *affd);
+				   struct irq_affinity *affd);
 
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
@@ -1422,7 +1422,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 static inline int
 pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 			       unsigned int max_vecs, unsigned int flags,
-			       const struct irq_affinity *aff_desc)
+			       struct irq_affinity *aff_desc)
 {
 	if ((flags & PCI_IRQ_LEGACY) && min_vecs == 1 && dev->irq)
 		return 1;
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 118b66d64a53..9200d3b26f7d 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -239,7 +239,7 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
  * Returns the irq_affinity_desc pointer or NULL if allocation failed.
  */
 struct irq_affinity_desc *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
 {
 	int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
 	int curvec, usedvecs;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity'
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
  2019-02-13 10:50 ` [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const Ming Lei
@ 2019-02-13 10:50 ` Ming Lei
  2019-02-13 15:07   ` Bjorn Helgaas
  2019-02-13 10:50 ` [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Currently the array of irq set vectors is provided by driver.

irq_create_affinity_masks() can be simplied a bit by treating the
non-irq-set case as single irq set.

So move this array into 'struct irq_affinity', and pre-define the max
set number as 4, which should be enough for normal cases.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c   |  5 ++---
 include/linux/interrupt.h |  6 ++++--
 kernel/irq/affinity.c     | 18 +++++++++++-------
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 022ea1ee63f8..0086bdf80ea1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2081,12 +2081,11 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int irq_sets[2];
 	struct irq_affinity affd = {
 		.pre_vectors = 1,
-		.nr_sets = ARRAY_SIZE(irq_sets),
-		.sets = irq_sets,
+		.nr_sets = 2,
 	};
+	int *irq_sets = affd.set_vectors;
 	int result = 0;
 	unsigned int irq_queues, this_p_queues;
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1ed1014c9684..a20150627a32 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -259,6 +259,8 @@ struct irq_affinity_notify {
 	void (*release)(struct kref *ref);
 };
 
+#define	IRQ_MAX_SETS  4
+
 /**
  * struct irq_affinity - Description for automatic irq affinity assignements
  * @pre_vectors:	Don't apply affinity to @pre_vectors at beginning of
@@ -266,13 +268,13 @@ struct irq_affinity_notify {
  * @post_vectors:	Don't apply affinity to @post_vectors at end of
  *			the MSI(-X) vector space
  * @nr_sets:		Length of passed in *sets array
- * @sets:		Number of affinitized sets
+ * @set_vectors:	Number of affinitized sets
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
 	int	nr_sets;
-	int	*sets;
+	int	set_vectors[IRQ_MAX_SETS];
 };
 
 /**
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 9200d3b26f7d..b868b9d3df7f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -244,7 +244,7 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
 	int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
 	int curvec, usedvecs;
 	struct irq_affinity_desc *masks = NULL;
-	int i, nr_sets;
+	int i;
 
 	/*
 	 * If there aren't any vectors left after applying the pre/post
@@ -253,6 +253,9 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
 	if (nvecs == affd->pre_vectors + affd->post_vectors)
 		return NULL;
 
+	if (affd->nr_sets > IRQ_MAX_SETS)
+		return NULL;
+
 	masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
 	if (!masks)
 		return NULL;
@@ -264,12 +267,13 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
 	 * Spread on present CPUs starting from affd->pre_vectors. If we
 	 * have multiple sets, build each sets affinity mask separately.
 	 */
-	nr_sets = affd->nr_sets;
-	if (!nr_sets)
-		nr_sets = 1;
+	if (!affd->nr_sets) {
+		affd->nr_sets = 1;
+		affd->set_vectors[0] = affvecs;
+	}
 
-	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
-		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
+	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
+		int this_vecs = affd->set_vectors[i];
 		int ret;
 
 		ret = irq_build_affinity_masks(affd, curvec, this_vecs,
@@ -316,7 +320,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 		int i;
 
 		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-			set_vecs += affd->sets[i];
+			set_vecs += affd->set_vectors[i];
 	} else {
 		get_online_cpus();
 		set_vecs = cpumask_weight(cpu_possible_mask);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
  2019-02-13 10:50 ` [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const Ming Lei
  2019-02-13 10:50 ` [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity' Ming Lei
@ 2019-02-13 10:50 ` Ming Lei
  2019-02-13 15:11   ` Bjorn Helgaas
  2019-02-13 10:50 ` [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Currently pre-caculated set vectors are provided by driver for
allocating & spread vectors. This way only works when drivers passes
same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
also requires driver to retry the allocating & spread.

As Bjorn and Keith mentioned, the current usage & interface for irq sets
is a bit awkward because the retrying should have been avoided by providing
one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
'max_vecs', number of the allocated vectors is unknown before calling
pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
pre-caculated.

Add a new callback of .calc_sets into 'struct irq_affinity' so that
driver can caculate set vectors after IRQ vector is allocated and
before spread IRQ vectors. Add 'priv' so that driver may retrieve
its private data via the 'struct irq_affinity'.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/interrupt.h | 4 ++++
 kernel/irq/affinity.c     | 8 ++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a20150627a32..7a27f6ba1f2f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -269,12 +269,16 @@ struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @nr_sets:		Length of passed in *sets array
  * @set_vectors:	Number of affinitized sets
+ * @calc_sets:		Callback for caculating set vectors
+ * @priv:		Private data of @calc_sets
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
 	int	nr_sets;
 	int	set_vectors[IRQ_MAX_SETS];
+	void	(*calc_sets)(struct irq_affinity *, int nvecs);
+	void	*priv;
 };
 
 /**
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index b868b9d3df7f..2341b1f005fa 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -267,7 +267,9 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
 	 * Spread on present CPUs starting from affd->pre_vectors. If we
 	 * have multiple sets, build each sets affinity mask separately.
 	 */
-	if (!affd->nr_sets) {
+	if (affd->calc_sets) {
+		affd->calc_sets(affd, nvecs);
+	} else if (!affd->nr_sets) {
 		affd->nr_sets = 1;
 		affd->set_vectors[0] = affvecs;
 	}
@@ -316,7 +318,9 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 	if (resv > minvec)
 		return 0;
 
-	if (affd->nr_sets) {
+	if (affd->calc_sets) {
+		set_vecs = vecs;
+	} else if (affd->nr_sets) {
 		int i;
 
 		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
                   ` (2 preceding siblings ...)
  2019-02-13 10:50 ` [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors Ming Lei
@ 2019-02-13 10:50 ` Ming Lei
  2019-02-13 15:13   ` Bjorn Helgaas
  2019-02-13 10:50 ` [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets Ming Lei
  2019-02-13 14:36 ` [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Jens Axboe
  5 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Currently pre-caculate each set vectors, and this way requires same
'max_vecs' and 'min_vecs' passed to pci_alloc_irq_vectors_affinity(),
then nvme_setup_irqs() has to retry in case of allocation failure.

This usage & interface is a bit awkward because the retry should have
been avoided by providing one reasonable 'min_vecs'.

Implement the callback of .calc_sets, so that pci_alloc_irq_vectors_affinity()
can calculate each set's vector after IRQ vectors is allocated and
before spread IRQ, then NVMe's retry in case of irq allocation failure
can be removed.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 62 +++++++++++++------------------------------------
 1 file changed, 16 insertions(+), 46 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0086bdf80ea1..8c51252a897e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2078,14 +2078,25 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
 	}
 }
 
+static void nvme_calc_irq_sets(struct irq_affinity *affd, int nvecs)
+{
+	struct nvme_dev *dev = affd->priv;
+
+	nvme_calc_io_queues(dev, nvecs);
+
+	affd->set_vectors[HCTX_TYPE_DEFAULT] = dev->io_queues[HCTX_TYPE_DEFAULT];
+	affd->set_vectors[HCTX_TYPE_READ] = dev->io_queues[HCTX_TYPE_READ];
+	affd->nr_sets = 2;
+}
+
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	struct irq_affinity affd = {
 		.pre_vectors = 1,
-		.nr_sets = 2,
+		.calc_sets = nvme_calc_irq_sets,
+		.priv = dev,
 	};
-	int *irq_sets = affd.set_vectors;
 	int result = 0;
 	unsigned int irq_queues, this_p_queues;
 
@@ -2102,50 +2113,8 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	}
 	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
 
-	/*
-	 * For irq sets, we have to ask for minvec == maxvec. This passes
-	 * any reduction back to us, so we can adjust our queue counts and
-	 * IRQ vector needs.
-	 */
-	do {
-		nvme_calc_io_queues(dev, irq_queues);
-		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
-		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
-		if (!irq_sets[1])
-			affd.nr_sets = 1;
-
-		/*
-		 * If we got a failure and we're down to asking for just
-		 * 1 + 1 queues, just ask for a single vector. We'll share
-		 * that between the single IO queue and the admin queue.
-		 * Otherwise, we assign one independent vector to admin queue.
-		 */
-		if (irq_queues > 1)
-			irq_queues = irq_sets[0] + irq_sets[1] + 1;
-
-		result = pci_alloc_irq_vectors_affinity(pdev, irq_queues,
-				irq_queues,
-				PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
-
-		/*
-		 * Need to reduce our vec counts. If we get ENOSPC, the
-		 * platform should support mulitple vecs, we just need
-		 * to decrease our ask. If we get EINVAL, the platform
-		 * likely does not. Back down to ask for just one vector.
-		 */
-		if (result == -ENOSPC) {
-			irq_queues--;
-			if (!irq_queues)
-				return result;
-			continue;
-		} else if (result == -EINVAL) {
-			irq_queues = 1;
-			continue;
-		} else if (result <= 0)
-			return -EIO;
-		break;
-	} while (1);
-
+	result = pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 	return result;
 }
 
@@ -3021,6 +2990,7 @@ static struct pci_driver nvme_driver = {
 
 static int __init nvme_init(void)
 {
+	BUILD_BUG_ON(2 > IRQ_MAX_SETS);
 	return pci_register_driver(&nvme_driver);
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
                   ` (3 preceding siblings ...)
  2019-02-13 10:50 ` [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets Ming Lei
@ 2019-02-13 10:50 ` Ming Lei
  2019-02-13 15:16   ` Bjorn Helgaas
  2019-02-13 14:36 ` [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Jens Axboe
  5 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-02-13 10:50 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, linux-kernel,
	linux-pci, Keith Busch, Ming Lei

Now NVMe has implemented the .calc_sets callback for caculating each
set's vectors.

For other cases of multiple IRQ sets, pre-caculating each set's vectors
before allocating IRQ vectors can't work because the whole vectors
number is unknow at that time.

So document .calc_sets as required explicitly for multiple sets.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/pci/msi.c         | 16 ++++++++++------
 include/linux/interrupt.h |  3 ++-
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 96978459e2a0..199d708b4099 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1036,10 +1036,12 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 		return -ERANGE;
 
 	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * vectors. The caller needs to handle that.
+	 * If the caller requests multiple sets of IRQs where each set
+	 * requires different affinity, it must also supply a ->calc_sets()
+	 * callback to compute vectors for each set after whole vectors are
+	 * allocated.
 	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
+	if (affd && affd->nr_sets > 1 && !affd->calc_sets)
 		return -EINVAL;
 
 	if (WARN_ON_ONCE(dev->msi_enabled))
@@ -1094,10 +1096,12 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 		return -ERANGE;
 
 	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * supported vectors. The caller needs to handle that.
+	 * If the caller requests multiple sets of IRQs where each set
+	 * requires different affinity, it must also supply a ->calc_sets()
+	 * callback to compute vectors for each set after whole vectors are
+	 * allocated.
 	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
+	if (affd && affd->nr_sets > 1 && !affd->calc_sets)
 		return -EINVAL;
 
 	if (WARN_ON_ONCE(dev->msix_enabled))
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 7a27f6ba1f2f..a053f7fb0ff1 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -269,7 +269,8 @@ struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @nr_sets:		Length of passed in *sets array
  * @set_vectors:	Number of affinitized sets
- * @calc_sets:		Callback for caculating set vectors
+ * @calc_sets:		Callback for caculating set vectors, required for
+ * 			multiple irq sets.
  * @priv:		Private data of @calc_sets
  */
 struct irq_affinity {
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread
  2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
                   ` (4 preceding siblings ...)
  2019-02-13 10:50 ` [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets Ming Lei
@ 2019-02-13 14:36 ` Jens Axboe
  5 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2019-02-13 14:36 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Bjorn Helgaas, Thomas Gleixner
  Cc: linux-block, Sagi Grimberg, linux-nvme, linux-kernel, linux-pci,
	Keith Busch

On 2/13/19 3:50 AM, Ming Lei wrote:
> Hi,
> 
> Currently pre-caculated set vectors are provided by driver for
> allocating & spread vectors. This way only works when drivers passes
> same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
> also requires driver to retry the allocating & spread.
> 
> As Bjorn and Keith mentioned, the current usage & interface for irq sets
> is a bit awkward because the retrying should have been avoided by providing
> one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
> 'max_vecs', number of the allocated vectors is unknown before calling
> pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
> pre-caculated.
> 
> Add a new callback of .calc_sets into 'struct irq_affinity' so that
> driver can caculate set vectors after IRQ vector is allocated and
> before spread IRQ vectors.

Much improved, thanks for doing this work. You can add

Reviewed-by: Jens Axboe <axboe@kernel.dk>

if you want.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 10:50 ` [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const Ming Lei
@ 2019-02-13 15:04   ` Bjorn Helgaas
  2019-02-13 20:56     ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2019-02-13 15:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> Currently all parameters in 'affd' are read-only, so 'affd' is marked
> as const in both pci_alloc_irq_vectors_affinity() and irq_create_affinity_masks().

s/all parameters in 'affd'/the contents of '*affd'/

> We have to ask driver to re-caculate set vectors after the whole IRQ
> vectors are allocated later, and the result needs to be stored in 'affd'.
> Also both the two interfaces are core APIs, which should be trusted.

s/re-caculate/recalculate/
s/stored in 'affd'/stored in '*affd'/
s/both the two/both/

This is a little confusing because you're talking about both "IRQ
vectors" and these other "set vectors", which I think are different
things.  I assume the "set vectors" are cpumasks showing the affinity
of the IRQ vectors with some CPUs?

AFAICT, *this* patch doesn't add anything that writes to *affd.  I
think the removal of "const" should be in the same patch that makes
the removal necessary.

> So don't mark 'affd' as const both pci_alloc_irq_vectors_affinity() and
> irq_create_affinity_masks().
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/pci/msi.c         | 18 +++++++++---------
>  include/linux/interrupt.h |  2 +-
>  include/linux/pci.h       |  4 ++--
>  kernel/irq/affinity.c     |  2 +-
>  4 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 4c0b47867258..96978459e2a0 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -532,7 +532,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
>  }
>  
>  static struct msi_desc *
> -msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
> +msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
>  {
>  	struct irq_affinity_desc *masks = NULL;
>  	struct msi_desc *entry;
> @@ -597,7 +597,7 @@ static int msi_verify_entries(struct pci_dev *dev)
>   * which could have been allocated.
>   */
>  static int msi_capability_init(struct pci_dev *dev, int nvec,
> -			       const struct irq_affinity *affd)
> +			       struct irq_affinity *affd)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -669,7 +669,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
>  
>  static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			      struct msix_entry *entries, int nvec,
> -			      const struct irq_affinity *affd)
> +			      struct irq_affinity *affd)
>  {
>  	struct irq_affinity_desc *curmsk, *masks = NULL;
>  	struct msi_desc *entry;
> @@ -736,7 +736,7 @@ static void msix_program_entries(struct pci_dev *dev,
>   * requested MSI-X entries with allocated irqs or non-zero for otherwise.
>   **/
>  static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
> -				int nvec, const struct irq_affinity *affd)
> +				int nvec, struct irq_affinity *affd)
>  {
>  	int ret;
>  	u16 control;
> @@ -932,7 +932,7 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  EXPORT_SYMBOL(pci_msix_vec_count);
>  
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> -			     int nvec, const struct irq_affinity *affd)
> +			     int nvec, struct irq_affinity *affd)
>  {
>  	int nr_entries;
>  	int i, j;
> @@ -1018,7 +1018,7 @@ int pci_msi_enabled(void)
>  EXPORT_SYMBOL(pci_msi_enabled);
>  
>  static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
> -				  const struct irq_affinity *affd)
> +				  struct irq_affinity *affd)
>  {
>  	int nvec;
>  	int rc;
> @@ -1086,7 +1086,7 @@ EXPORT_SYMBOL(pci_enable_msi);
>  
>  static int __pci_enable_msix_range(struct pci_dev *dev,
>  				   struct msix_entry *entries, int minvec,
> -				   int maxvec, const struct irq_affinity *affd)
> +				   int maxvec, struct irq_affinity *affd)
>  {
>  	int rc, nvec = maxvec;
>  
> @@ -1165,9 +1165,9 @@ EXPORT_SYMBOL(pci_enable_msix_range);
>   */
>  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   unsigned int max_vecs, unsigned int flags,
> -				   const struct irq_affinity *affd)
> +				   struct irq_affinity *affd)
>  {
> -	static const struct irq_affinity msi_default_affd;
> +	struct irq_affinity msi_default_affd = {0};
>  	int msix_vecs = -ENOSPC;
>  	int msi_vecs = -ENOSPC;
>  
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 7c9434652f36..1ed1014c9684 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -332,7 +332,7 @@ extern int
>  irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
>  
>  struct irq_affinity_desc *
> -irq_create_affinity_masks(int nvec, const struct irq_affinity *affd);
> +irq_create_affinity_masks(int nvec, struct irq_affinity *affd);
>  
>  int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd);
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 40b327b814aa..4eca42cf611b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1396,7 +1396,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  }
>  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   unsigned int max_vecs, unsigned int flags,
> -				   const struct irq_affinity *affd);
> +				   struct irq_affinity *affd);
>  
>  void pci_free_irq_vectors(struct pci_dev *dev);
>  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
> @@ -1422,7 +1422,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  static inline int
>  pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  			       unsigned int max_vecs, unsigned int flags,
> -			       const struct irq_affinity *aff_desc)
> +			       struct irq_affinity *aff_desc)
>  {
>  	if ((flags & PCI_IRQ_LEGACY) && min_vecs == 1 && dev->irq)
>  		return 1;
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 118b66d64a53..9200d3b26f7d 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -239,7 +239,7 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
>   * Returns the irq_affinity_desc pointer or NULL if allocation failed.
>   */
>  struct irq_affinity_desc *
> -irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
> +irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
>  {
>  	int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
>  	int curvec, usedvecs;
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity'
  2019-02-13 10:50 ` [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity' Ming Lei
@ 2019-02-13 15:07   ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2019-02-13 15:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, Feb 13, 2019 at 06:50:38PM +0800, Ming Lei wrote:
> Currently the array of irq set vectors is provided by driver.
> 
> irq_create_affinity_masks() can be simplied a bit by treating the
> non-irq-set case as single irq set.

s/simplied a bit/simplified/

> So move this array into 'struct irq_affinity', and pre-define the max
> set number as 4, which should be enough for normal cases.

s/irq/IRQ/
You have a real mixture of capitalization across the changelogs.

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c   |  5 ++---
>  include/linux/interrupt.h |  6 ++++--
>  kernel/irq/affinity.c     | 18 +++++++++++-------
>  3 files changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 022ea1ee63f8..0086bdf80ea1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2081,12 +2081,11 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
>  static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
> -	int irq_sets[2];
>  	struct irq_affinity affd = {
>  		.pre_vectors = 1,
> -		.nr_sets = ARRAY_SIZE(irq_sets),
> -		.sets = irq_sets,
> +		.nr_sets = 2,
>  	};
> +	int *irq_sets = affd.set_vectors;
>  	int result = 0;
>  	unsigned int irq_queues, this_p_queues;
>  
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 1ed1014c9684..a20150627a32 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -259,6 +259,8 @@ struct irq_affinity_notify {
>  	void (*release)(struct kref *ref);
>  };
>  
> +#define	IRQ_MAX_SETS  4

This is a pretty generic name.  Maybe it should include a hint that it's
related to affinity?

>  /**
>   * struct irq_affinity - Description for automatic irq affinity assignements
>   * @pre_vectors:	Don't apply affinity to @pre_vectors at beginning of
> @@ -266,13 +268,13 @@ struct irq_affinity_notify {
>   * @post_vectors:	Don't apply affinity to @post_vectors at end of
>   *			the MSI(-X) vector space
>   * @nr_sets:		Length of passed in *sets array
> - * @sets:		Number of affinitized sets
> + * @set_vectors:	Number of affinitized sets
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
>  	int	nr_sets;
> -	int	*sets;
> +	int	set_vectors[IRQ_MAX_SETS];
>  };
>  
>  /**
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 9200d3b26f7d..b868b9d3df7f 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -244,7 +244,7 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
>  	int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
>  	int curvec, usedvecs;
>  	struct irq_affinity_desc *masks = NULL;
> -	int i, nr_sets;
> +	int i;
>  
>  	/*
>  	 * If there aren't any vectors left after applying the pre/post
> @@ -253,6 +253,9 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
>  	if (nvecs == affd->pre_vectors + affd->post_vectors)
>  		return NULL;
>  
> +	if (affd->nr_sets > IRQ_MAX_SETS)
> +		return NULL;
> +
>  	masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
>  	if (!masks)
>  		return NULL;
> @@ -264,12 +267,13 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
>  	 * Spread on present CPUs starting from affd->pre_vectors. If we
>  	 * have multiple sets, build each sets affinity mask separately.
>  	 */
> -	nr_sets = affd->nr_sets;
> -	if (!nr_sets)
> -		nr_sets = 1;
> +	if (!affd->nr_sets) {
> +		affd->nr_sets = 1;
> +		affd->set_vectors[0] = affvecs;
> +	}
>  
> -	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> -		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> +	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
> +		int this_vecs = affd->set_vectors[i];
>  		int ret;
>  
>  		ret = irq_build_affinity_masks(affd, curvec, this_vecs,
> @@ -316,7 +320,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  		int i;
>  
>  		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> -			set_vecs += affd->sets[i];
> +			set_vecs += affd->set_vectors[i];
>  	} else {
>  		get_online_cpus();
>  		set_vecs = cpumask_weight(cpu_possible_mask);
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors
  2019-02-13 10:50 ` [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors Ming Lei
@ 2019-02-13 15:11   ` Bjorn Helgaas
  2019-02-13 20:58     ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2019-02-13 15:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, Feb 13, 2019 at 06:50:39PM +0800, Ming Lei wrote:
> Currently pre-caculated set vectors are provided by driver for
> allocating & spread vectors. This way only works when drivers passes
> same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
> also requires driver to retry the allocating & spread.

s/pre-caculated/precalculated/
s/drivers/a driver/
s/also requires/which also requires the/

> As Bjorn and Keith mentioned, the current usage & interface for irq sets
> is a bit awkward because the retrying should have been avoided by providing
> one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
> 'max_vecs', number of the allocated vectors is unknown before calling
> pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
> pre-caculated.

s/resonable/reasonable/
s/isn't same with/isn't the same as/
s/number of/the number of/
s/pre-caculated/precalculated/
s/then each/and the/
s/irq/IRQ/

> Add a new callback of .calc_sets into 'struct irq_affinity' so that
> driver can caculate set vectors after IRQ vector is allocated and
> before spread IRQ vectors. Add 'priv' so that driver may retrieve
> its private data via the 'struct irq_affinity'.

s/caculate/calculate/

I *think* "set vectors" here has something to do with an affinity
cpumask?  "Set vectors" doesn't seem like quite the right terminology.

> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/interrupt.h | 4 ++++
>  kernel/irq/affinity.c     | 8 ++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index a20150627a32..7a27f6ba1f2f 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -269,12 +269,16 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @nr_sets:		Length of passed in *sets array
>   * @set_vectors:	Number of affinitized sets
> + * @calc_sets:		Callback for caculating set vectors

Possible s/set vectors/<something else>/ ?

> + * @priv:		Private data of @calc_sets
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
>  	int	nr_sets;
>  	int	set_vectors[IRQ_MAX_SETS];
> +	void	(*calc_sets)(struct irq_affinity *, int nvecs);
> +	void	*priv;
>  };
>  
>  /**
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index b868b9d3df7f..2341b1f005fa 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -267,7 +267,9 @@ irq_create_affinity_masks(int nvecs, struct irq_affinity *affd)
>  	 * Spread on present CPUs starting from affd->pre_vectors. If we
>  	 * have multiple sets, build each sets affinity mask separately.
>  	 */
> -	if (!affd->nr_sets) {
> +	if (affd->calc_sets) {
> +		affd->calc_sets(affd, nvecs);
> +	} else if (!affd->nr_sets) {
>  		affd->nr_sets = 1;
>  		affd->set_vectors[0] = affvecs;
>  	}
> @@ -316,7 +318,9 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  	if (resv > minvec)
>  		return 0;
>  
> -	if (affd->nr_sets) {
> +	if (affd->calc_sets) {
> +		set_vecs = vecs;
> +	} else if (affd->nr_sets) {
>  		int i;
>  
>  		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets
  2019-02-13 10:50 ` [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets Ming Lei
@ 2019-02-13 15:13   ` Bjorn Helgaas
  2019-02-13 21:26     ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2019-02-13 15:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, Feb 13, 2019 at 06:50:40PM +0800, Ming Lei wrote:
> Currently pre-caculate each set vectors, and this way requires same
> 'max_vecs' and 'min_vecs' passed to pci_alloc_irq_vectors_affinity(),
> then nvme_setup_irqs() has to retry in case of allocation failure.

s/pre-caculate/precalculate/
My usual "set vectors" question as on other patches.

> This usage & interface is a bit awkward because the retry should have
> been avoided by providing one reasonable 'min_vecs'.
> 
> Implement the callback of .calc_sets, so that pci_alloc_irq_vectors_affinity()
> can calculate each set's vector after IRQ vectors is allocated and
> before spread IRQ, then NVMe's retry in case of irq allocation failure
> can be removed.

s/irq/IRQ/

> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 62 +++++++++++++------------------------------------
>  1 file changed, 16 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 0086bdf80ea1..8c51252a897e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2078,14 +2078,25 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
>  	}
>  }
>  
> +static void nvme_calc_irq_sets(struct irq_affinity *affd, int nvecs)
> +{
> +	struct nvme_dev *dev = affd->priv;
> +
> +	nvme_calc_io_queues(dev, nvecs);
> +
> +	affd->set_vectors[HCTX_TYPE_DEFAULT] = dev->io_queues[HCTX_TYPE_DEFAULT];
> +	affd->set_vectors[HCTX_TYPE_READ] = dev->io_queues[HCTX_TYPE_READ];
> +	affd->nr_sets = 2;
> +}
> +
>  static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
>  	struct irq_affinity affd = {
>  		.pre_vectors = 1,
> -		.nr_sets = 2,
> +		.calc_sets = nvme_calc_irq_sets,
> +		.priv = dev,
>  	};
> -	int *irq_sets = affd.set_vectors;
>  	int result = 0;
>  	unsigned int irq_queues, this_p_queues;
>  
> @@ -2102,50 +2113,8 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>  	}
>  	dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
>  
> -	/*
> -	 * For irq sets, we have to ask for minvec == maxvec. This passes
> -	 * any reduction back to us, so we can adjust our queue counts and
> -	 * IRQ vector needs.
> -	 */
> -	do {
> -		nvme_calc_io_queues(dev, irq_queues);
> -		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
> -		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
> -		if (!irq_sets[1])
> -			affd.nr_sets = 1;
> -
> -		/*
> -		 * If we got a failure and we're down to asking for just
> -		 * 1 + 1 queues, just ask for a single vector. We'll share
> -		 * that between the single IO queue and the admin queue.
> -		 * Otherwise, we assign one independent vector to admin queue.
> -		 */
> -		if (irq_queues > 1)
> -			irq_queues = irq_sets[0] + irq_sets[1] + 1;
> -
> -		result = pci_alloc_irq_vectors_affinity(pdev, irq_queues,
> -				irq_queues,
> -				PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> -
> -		/*
> -		 * Need to reduce our vec counts. If we get ENOSPC, the
> -		 * platform should support mulitple vecs, we just need
> -		 * to decrease our ask. If we get EINVAL, the platform
> -		 * likely does not. Back down to ask for just one vector.
> -		 */
> -		if (result == -ENOSPC) {
> -			irq_queues--;
> -			if (!irq_queues)
> -				return result;
> -			continue;
> -		} else if (result == -EINVAL) {
> -			irq_queues = 1;
> -			continue;
> -		} else if (result <= 0)
> -			return -EIO;
> -		break;
> -	} while (1);
> -
> +	result = pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
> +			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>  	return result;
>  }
>  
> @@ -3021,6 +2990,7 @@ static struct pci_driver nvme_driver = {
>  
>  static int __init nvme_init(void)
>  {
> +	BUILD_BUG_ON(2 > IRQ_MAX_SETS);

"IRQ_MAX_SETS < 2" would read more naturally; is there a reason to have it
reversed?

>  	return pci_register_driver(&nvme_driver);
>  }
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets
  2019-02-13 10:50 ` [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets Ming Lei
@ 2019-02-13 15:16   ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2019-02-13 15:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, Feb 13, 2019 at 06:50:41PM +0800, Ming Lei wrote:
> Now NVMe has implemented the .calc_sets callback for caculating each
> set's vectors.

s/caculating/calculating/

> For other cases of multiple IRQ sets, pre-caculating each set's vectors
> before allocating IRQ vectors can't work because the whole vectors
> number is unknow at that time.

s/unknow/unknown/
Maybe spell check could be helpful?

> So document .calc_sets as required explicitly for multiple sets.
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/pci/msi.c         | 16 ++++++++++------
>  include/linux/interrupt.h |  3 ++-
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 96978459e2a0..199d708b4099 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1036,10 +1036,12 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  		return -ERANGE;
>  
>  	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * vectors. The caller needs to handle that.
> +	 * If the caller requests multiple sets of IRQs where each set
> +	 * requires different affinity, it must also supply a ->calc_sets()
> +	 * callback to compute vectors for each set after whole vectors are
> +	 * allocated.

"compute vectors for each set after whole vectors are allocated"
doesn't read smoothly.  The caller is requesting "multiple sets of
IRQs".  So each set contains several IRQ vectors.  IIUC, the vectors
calc_sets() are computing are something other than IRQ vectors.

>  	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> +	if (affd && affd->nr_sets > 1 && !affd->calc_sets)
>  		return -EINVAL;
>  
>  	if (WARN_ON_ONCE(dev->msi_enabled))
> @@ -1094,10 +1096,12 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  		return -ERANGE;
>  
>  	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * supported vectors. The caller needs to handle that.
> +	 * If the caller requests multiple sets of IRQs where each set
> +	 * requires different affinity, it must also supply a ->calc_sets()
> +	 * callback to compute vectors for each set after whole vectors are
> +	 * allocated.
>  	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> +	if (affd && affd->nr_sets > 1 && !affd->calc_sets)
>  		return -EINVAL;
>  
>  	if (WARN_ON_ONCE(dev->msix_enabled))
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 7a27f6ba1f2f..a053f7fb0ff1 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -269,7 +269,8 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @nr_sets:		Length of passed in *sets array
>   * @set_vectors:	Number of affinitized sets
> - * @calc_sets:		Callback for caculating set vectors
> + * @calc_sets:		Callback for caculating set vectors, required for
> + * 			multiple irq sets.
>   * @priv:		Private data of @calc_sets
>   */
>  struct irq_affinity {
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 15:04   ` Bjorn Helgaas
@ 2019-02-13 20:56     ` Thomas Gleixner
  2019-02-13 21:31       ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-13 20:56 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:

> On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> > Currently all parameters in 'affd' are read-only, so 'affd' is marked
> > as const in both pci_alloc_irq_vectors_affinity() and irq_create_affinity_masks().
> 
> s/all parameters in 'affd'/the contents of '*affd'/
> 
> > We have to ask driver to re-caculate set vectors after the whole IRQ
> > vectors are allocated later, and the result needs to be stored in 'affd'.
> > Also both the two interfaces are core APIs, which should be trusted.
> 
> s/re-caculate/recalculate/
> s/stored in 'affd'/stored in '*affd'/
> s/both the two/both/
> 
> This is a little confusing because you're talking about both "IRQ
> vectors" and these other "set vectors", which I think are different
> things.  I assume the "set vectors" are cpumasks showing the affinity
> of the IRQ vectors with some CPUs?

I think we should drop the whole vector wording completely.

The driver does not care about vectors, it only cares about a block of
interrupt numbers. These numbers are kernel managed and the interrupts just
happen to have a CPU vector assigned at some point. Depending on the CPU
architecture the underlying mechanism might not even be named vector.

> AFAICT, *this* patch doesn't add anything that writes to *affd.  I
> think the removal of "const" should be in the same patch that makes
> the removal necessary.

So this should be:

   The interrupt affinity spreading mechanism supports to spread out
   affinities for one or more interrupt sets. A interrupt set contains one
   or more interrupts. Each set is mapped to a specific functionality of a
   device, e.g. general I/O queues and read I/O queus of multiqueue block
   devices.

   The number of interrupts per set is defined by the driver. It depends on
   the total number of available interrupts for the device, which is
   determined by the PCI capabilites and the availability of underlying CPU
   resources, and the number of queues which the device provides and the
   driver wants to instantiate.

   The driver passes initial configuration for the interrupt allocation via
   a pointer to struct affinity_desc.

   Right now the allocation mechanism is complex as it requires to have a
   loop in the driver to determine the maximum number of interrupts which
   are provided by the PCI capabilities and the underlying CPU resources.
   This loop would have to be replicated in every driver which wants to
   utilize this mechanism. That's unwanted code duplication and error
   prone.

   In order to move this into generic facilities it is required to have a
   mechanism, which allows the recalculation of the interrupt sets and
   their size, in the core code. As the core code does not have any
   knowledge about the underlying device, a driver specific callback will
   be added to struct affinity_desc, which will be invoked by the core
   code. The callback will get the number of available interupts as an
   argument, so the driver can calculate the corresponding number and size
   of interrupt sets.

   To support this, two modifications for the handling of struct
   affinity_desc are required:

   1) The (optional) interrupt sets size information is contained in a
      separate array of integers and struct affinity_desc contains a
      pointer to it.

      This is cumbersome and as the maximum number of interrupt sets is
      small, there is no reason to have separate storage. Moving the size
      array into struct affinity_desc avoids indirections makes the code
      simpler.

   2) At the moment the struct affinity_desc pointer which is handed in from
      the driver and passed through to several core functions is marked
      'const'.

      With the upcoming callback to recalculate the number and size of
      interrupt sets, it's necessary to remove the 'const'
      qualifier. Otherwise the callback would not be able to update the
      data.

   Move the set size array into struct affinity_desc as a first preparatory
   step. The removal of the 'const' qualifier will be done when adding the
   callback.

IOW, The first patch moves the set array into the struct itself.

The second patch introduces the callback and removes the 'const'
qualifier. I wouldn't mind to have the same changelog duplicated (+/- the
last two paragraphs which need some update of course).

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors
  2019-02-13 15:11   ` Bjorn Helgaas
@ 2019-02-13 20:58     ` Thomas Gleixner
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-13 20:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:
> > Add a new callback of .calc_sets into 'struct irq_affinity' so that
> > driver can caculate set vectors after IRQ vector is allocated and
> > before spread IRQ vectors. Add 'priv' so that driver may retrieve
> > its private data via the 'struct irq_affinity'.
> 
> s/caculate/calculate/
> 
> I *think* "set vectors" here has something to do with an affinity
> cpumask?  "Set vectors" doesn't seem like quite the right terminology.

See previous mail. Let's stay with 'interrupt sets' and get rid of the
vector naming completely.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets
  2019-02-13 15:13   ` Bjorn Helgaas
@ 2019-02-13 21:26     ` Thomas Gleixner
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-13 21:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-pci, Keith Busch

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:

> On Wed, Feb 13, 2019 at 06:50:40PM +0800, Ming Lei wrote:
> > Currently pre-caculate each set vectors, and this way requires same
> > 'max_vecs' and 'min_vecs' passed to pci_alloc_irq_vectors_affinity(),
> > then nvme_setup_irqs() has to retry in case of allocation failure.
> 
> s/pre-caculate/precalculate/
> My usual "set vectors" question as on other patches.
> 
> > This usage & interface is a bit awkward because the retry should have
> > been avoided by providing one reasonable 'min_vecs'.
> > 
> > Implement the callback of .calc_sets, so that pci_alloc_irq_vectors_affinity()
> > can calculate each set's vector after IRQ vectors is allocated and
> > before spread IRQ, then NVMe's retry in case of irq allocation failure
> > can be removed.
> 
> s/irq/IRQ/

Let me rephrase that thing as well

  Subject: nvme-pci: Simplify interrupt allocation

  The NVME PCI driver contains a tedious mechanism for interrupt
  allocation, which is necessary to adjust the number and size of interrupt
  sets to the maximum available number of interrupts which depends on the
  underlying PCI capabilities and the available CPU resources.

  It works around the former short comings of the PCI and core interrupt
  allocation mechanims in combination with interrupt sets.

  The PCI interrupt allocation function allows to provide a maximum and a
  minimum number of interrupts to be allocated and tries to allocate as
  many as possible. This worked without driver interaction as long as there
  was only a single set of interrupts to handle.

  With the addition of support for multiple interrupt sets in the generic
  affinity spreading logic, which is invoked from the PCI interrupt
  allocation, the adaptive loop in the PCI interrupt allocation did not
  work for multiple interrupt sets. The reason is that depending on the
  total number of interrupts which the PCI allocation adaptive loop tries
  to allocate in each step, the number and the size of the interrupt sets
  need to be adapted as well. Due to the way the interrupt sets support was
  implemented there was no way for the PCI interrupt allocation code or the
  core affinity spreading mechanism to invoke a driver specific function
  for adapting the interrupt sets configuration.

  As a consequence the driver had to implement another adaptive loop around
  the PCI interrupt allocation function and calling that with maximum and
  minimum interrupts set to the same value. This ensured that the
  allocation either succeeded or immediately failed without any attempt to
  adjust the number of interrupts in the PCI code.

  The core code now allows drivers to provide a callback to recalculate the
  number and the size of interrupt sets during PCI interrupt allocation,
  which in turn allows the PCI interrupt allocation function to be called
  in the same way as with a single set of interrupts. The PCI code handles
  the adaptive loop and the interrupt affinity spreading mechanism invokes
  the driver callback to adapt the interrupt set configuration to the
  current loop value. This replaces the adaptive loop in the driver
  completely.

  Implement the NVME specific callback which adjusts the interrupt sets
  configuration and remove the adaptive allocation loop.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 20:56     ` Thomas Gleixner
@ 2019-02-13 21:31       ` Keith Busch
  2019-02-13 21:41         ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2019-02-13 21:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci,
	linux-kernel, linux-nvme, Ming Lei, linux-block,
	Christoph Hellwig

On Wed, Feb 13, 2019 at 09:56:36PM +0100, Thomas Gleixner wrote:
> On Wed, 13 Feb 2019, Bjorn Helgaas wrote:
> > On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> > > We have to ask driver to re-caculate set vectors after the whole IRQ
> > > vectors are allocated later, and the result needs to be stored in 'affd'.
> > > Also both the two interfaces are core APIs, which should be trusted.
> > 
> > s/re-caculate/recalculate/
> > s/stored in 'affd'/stored in '*affd'/
> > s/both the two/both/
> > 
> > This is a little confusing because you're talking about both "IRQ
> > vectors" and these other "set vectors", which I think are different
> > things.  I assume the "set vectors" are cpumasks showing the affinity
> > of the IRQ vectors with some CPUs?
> 
> I think we should drop the whole vector wording completely.
> 
> The driver does not care about vectors, it only cares about a block of
> interrupt numbers. These numbers are kernel managed and the interrupts just
> happen to have a CPU vector assigned at some point. Depending on the CPU
> architecture the underlying mechanism might not even be named vector.

Perhaps longer term we could move affinity mask creation from the irq
subsystem into a more generic library. Interrupts aren't the only
resource that want to spread across CPUs. For example, blk-mq has it's
own implementation to for polled queues, so I think a non-irq specific
implementation would be a nice addition to the kernel lib.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 21:31       ` Keith Busch
@ 2019-02-13 21:41         ` Thomas Gleixner
  2019-02-13 22:37           ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-13 21:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci,
	linux-kernel, linux-nvme, Ming Lei, linux-block,
	Christoph Hellwig

On Wed, 13 Feb 2019, Keith Busch wrote:
> On Wed, Feb 13, 2019 at 09:56:36PM +0100, Thomas Gleixner wrote:
> > On Wed, 13 Feb 2019, Bjorn Helgaas wrote:
> > > On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> > > > We have to ask driver to re-caculate set vectors after the whole IRQ
> > > > vectors are allocated later, and the result needs to be stored in 'affd'.
> > > > Also both the two interfaces are core APIs, which should be trusted.
> > > 
> > > s/re-caculate/recalculate/
> > > s/stored in 'affd'/stored in '*affd'/
> > > s/both the two/both/
> > > 
> > > This is a little confusing because you're talking about both "IRQ
> > > vectors" and these other "set vectors", which I think are different
> > > things.  I assume the "set vectors" are cpumasks showing the affinity
> > > of the IRQ vectors with some CPUs?
> > 
> > I think we should drop the whole vector wording completely.
> > 
> > The driver does not care about vectors, it only cares about a block of
> > interrupt numbers. These numbers are kernel managed and the interrupts just
> > happen to have a CPU vector assigned at some point. Depending on the CPU
> > architecture the underlying mechanism might not even be named vector.
> 
> Perhaps longer term we could move affinity mask creation from the irq
> subsystem into a more generic library. Interrupts aren't the only
> resource that want to spread across CPUs. For example, blk-mq has it's
> own implementation to for polled queues, so I think a non-irq specific
> implementation would be a nice addition to the kernel lib.

Agreed. There is nothing interrupt specific in that code aside of some
name choices.

Btw, while I have your attention. There popped up an issue recently related
to that affinity logic.

The current implementation fails when:

        /*
         * If there aren't any vectors left after applying the pre/post
         * vectors don't bother with assigning affinity.
	 */
	if (nvecs == affd->pre_vectors + affd->post_vectors)
    		return NULL;

Now the discussion arised, that in that case the affinity sets are not
allocated and filled in for the pre/post vectors, but somehow the
underlying device still works and later on triggers the warning in the
blk-mq code because the MSI entries do not have affinity information
attached.

Sure, we could make that work, but there are several issues:

    1) irq_create_affinity_masks() has another reason to return NULL:
       memory allocation fails.

    2) Does it make sense at all.

Right now the PCI allocator ignores the NULL return and proceeds without
setting any affinities. As a consequence nothing is managed and everything
happens to work.

But that happens to work is more by chance than by design and the warning
is bogus if this is an expected mode of operation.

We should address these points in some way.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 21:41         ` Thomas Gleixner
@ 2019-02-13 22:37           ` Keith Busch
  2019-02-14  8:50             ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2019-02-13 22:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci,
	linux-kernel, linux-nvme, Ming Lei, linux-block,
	Christoph Hellwig

On Wed, Feb 13, 2019 at 10:41:55PM +0100, Thomas Gleixner wrote:
> Btw, while I have your attention. There popped up an issue recently related
> to that affinity logic.
> 
> The current implementation fails when:
> 
>         /*
>          * If there aren't any vectors left after applying the pre/post
>          * vectors don't bother with assigning affinity.
> 	 */
> 	if (nvecs == affd->pre_vectors + affd->post_vectors)
>     		return NULL;
> 
> Now the discussion arised, that in that case the affinity sets are not
> allocated and filled in for the pre/post vectors, but somehow the
> underlying device still works and later on triggers the warning in the
> blk-mq code because the MSI entries do not have affinity information
> attached.
>
> Sure, we could make that work, but there are several issues:
> 
>     1) irq_create_affinity_masks() has another reason to return NULL:
>        memory allocation fails.
> 
>     2) Does it make sense at all.
> 
> Right now the PCI allocator ignores the NULL return and proceeds without
> setting any affinities. As a consequence nothing is managed and everything
> happens to work.
> 
> But that happens to work is more by chance than by design and the warning
> is bogus if this is an expected mode of operation.
> 
> We should address these points in some way.

Ah, yes, that's a mistake in the nvme driver. It is assuming IO queues are
always on managed interrupts, but that's not true if when only 1 vector
could be allocated. This should be an appropriate fix to the warning:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 022ea1ee63f8..f2ccebe1c926 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -506,7 +506,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		 * affinity), so use the regular blk-mq cpu mapping
 		 */
 		map->queue_offset = qoff;
-		if (i != HCTX_TYPE_POLL)
+		if (i != HCTX_TYPE_POLL && dev->num_vecs > 1)
 			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
 		else
 			blk_mq_map_queues(map);
--

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-13 22:37           ` Keith Busch
@ 2019-02-14  8:50             ` Thomas Gleixner
  2019-02-14 13:04               ` 陈华才
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-14  8:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci, LKML,
	linux-nvme, Ming Lei, linux-block, Christoph Hellwig,
	Huacai Chen

On Wed, 13 Feb 2019, Keith Busch wrote:

Cc+ Huacai Chen

> On Wed, Feb 13, 2019 at 10:41:55PM +0100, Thomas Gleixner wrote:
> > Btw, while I have your attention. There popped up an issue recently related
> > to that affinity logic.
> > 
> > The current implementation fails when:
> > 
> >         /*
> >          * If there aren't any vectors left after applying the pre/post
> >          * vectors don't bother with assigning affinity.
> > 	 */
> > 	if (nvecs == affd->pre_vectors + affd->post_vectors)
> >     		return NULL;
> > 
> > Now the discussion arised, that in that case the affinity sets are not
> > allocated and filled in for the pre/post vectors, but somehow the
> > underlying device still works and later on triggers the warning in the
> > blk-mq code because the MSI entries do not have affinity information
> > attached.
> >
> > Sure, we could make that work, but there are several issues:
> > 
> >     1) irq_create_affinity_masks() has another reason to return NULL:
> >        memory allocation fails.
> > 
> >     2) Does it make sense at all.
> > 
> > Right now the PCI allocator ignores the NULL return and proceeds without
> > setting any affinities. As a consequence nothing is managed and everything
> > happens to work.
> > 
> > But that happens to work is more by chance than by design and the warning
> > is bogus if this is an expected mode of operation.
> > 
> > We should address these points in some way.
> 
> Ah, yes, that's a mistake in the nvme driver. It is assuming IO queues are
> always on managed interrupts, but that's not true if when only 1 vector
> could be allocated. This should be an appropriate fix to the warning:

Looks correct. Chen, can you please test that?

> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 022ea1ee63f8..f2ccebe1c926 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -506,7 +506,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  		 * affinity), so use the regular blk-mq cpu mapping
>  		 */
>  		map->queue_offset = qoff;
> -		if (i != HCTX_TYPE_POLL)
> +		if (i != HCTX_TYPE_POLL && dev->num_vecs > 1)
>  			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
>  		else
>  			blk_mq_map_queues(map);
> --
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-14  8:50             ` Thomas Gleixner
@ 2019-02-14 13:04               ` 陈华才
  2019-02-14 13:31                 ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: 陈华才 @ 2019-02-14 13:04 UTC (permalink / raw)
  To: Thomas Gleixner, Keith Busch
  Cc: Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci, LKML,
	linux-nvme, Ming Lei, linux-block, Christoph Hellwig

I'll test next week, but 4.19 has the same problem, how to fix that for 4.19?

Huacai
 
------------------ Original ------------------
From:  "Thomas Gleixner"<tglx@linutronix.de>;
Date:  Thu, Feb 14, 2019 04:50 PM
To:  "Keith Busch"<keith.busch@intel.com>;
Cc:  "Bjorn Helgaas"<helgaas@kernel.org>; "Jens Axboe"<axboe@kernel.dk>; "Sagi Grimberg"<sagi@grimberg.me>; "linux-pci"<linux-pci@vger.kernel.org>; "LKML"<linux-kernel@vger.kernel.org>; "linux-nvme"<linux-nvme@lists.infradead.org>; "Ming Lei"<ming.lei@redhat.com>; "linux-block"<linux-block@vger.kernel.org>; "Christoph Hellwig"<hch@lst.de>; "Huacai Chen"<chenhc@lemote.com>;
Subject:  Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
 
On Wed, 13 Feb 2019, Keith Busch wrote:

Cc+ Huacai Chen

> On Wed, Feb 13, 2019 at 10:41:55PM +0100, Thomas Gleixner wrote:
> > Btw, while I have your attention. There popped up an issue recently related
> > to that affinity logic.
> > 
> > The current implementation fails when:
> > 
> >         /*
> >          * If there aren't any vectors left after applying the pre/post
> >          * vectors don't bother with assigning affinity.
> > */
> > if (nvecs == affd->pre_vectors + affd->post_vectors)
> >     return NULL;
> > 
> > Now the discussion arised, that in that case the affinity sets are not
> > allocated and filled in for the pre/post vectors, but somehow the
> > underlying device still works and later on triggers the warning in the
> > blk-mq code because the MSI entries do not have affinity information
> > attached.
> >
> > Sure, we could make that work, but there are several issues:
> > 
> >     1) irq_create_affinity_masks() has another reason to return NULL:
> >        memory allocation fails.
> > 
> >     2) Does it make sense at all.
> > 
> > Right now the PCI allocator ignores the NULL return and proceeds without
> > setting any affinities. As a consequence nothing is managed and everything
> > happens to work.
> > 
> > But that happens to work is more by chance than by design and the warning
> > is bogus if this is an expected mode of operation.
> > 
> > We should address these points in some way.
> 
> Ah, yes, that's a mistake in the nvme driver. It is assuming IO queues are
> always on managed interrupts, but that's not true if when only 1 vector
> could be allocated. This should be an appropriate fix to the warning:

Looks correct. Chen, can you please test that?

> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 022ea1ee63f8..f2ccebe1c926 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -506,7 +506,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  * affinity), so use the regular blk-mq cpu mapping
>  */
>  map->queue_offset = qoff;
> -	if (i != HCTX_TYPE_POLL)
> +	if (i != HCTX_TYPE_POLL && dev->num_vecs > 1)
>  blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
>  else
>  blk_mq_map_queues(map);
> --
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-14 13:04               ` 陈华才
@ 2019-02-14 13:31                 ` Thomas Gleixner
  2019-02-19  0:42                   ` 陈华才
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-14 13:31 UTC (permalink / raw)
  To: 陈华才
  Cc: Keith Busch, Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci,
	LKML, linux-nvme, Ming Lei, linux-block, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 204 bytes --]

On Thu, 14 Feb 2019, 陈华才 wrote:

Please do not top post....

> I'll test next week, but 4.19 has the same problem, how to fix that for 4.19?

By applying the very same patch perhaps?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-14 13:31                 ` Thomas Gleixner
@ 2019-02-19  0:42                   ` 陈华才
  2019-02-19  6:19                     ` Thomas Gleixner
  2019-02-19 16:12                     ` Keith Busch
  0 siblings, 2 replies; 24+ messages in thread
From: 陈华才 @ 2019-02-19  0:42 UTC (permalink / raw)
  To: Thomas Gleixner, Huacai Chen
  Cc: Keith Busch, Bjorn Helgaas, Jens Axboe, Sagi Grimberg, linux-pci,
	LKML, linux-nvme, Ming Lei, linux-block, Christoph Hellwig

My lemote.com email is too weak, I'll stop top post because this time I've
cc-ed my gmail.

I've tested, this patch can fix the nvme problem, but it can't be applied
to 4.19 because of different context. And, I still think my original solution
(genirq/affinity: Assign default affinity to pre/post vectors) is correct.
There may be similar problems except nvme.
 
Huacai

------------------ Original ------------------
From:  "Thomas Gleixner"<tglx@linutronix.de>;
Date:  Thu, Feb 14, 2019 09:31 PM
To:  "陈华才"<chenhc@lemote.com>;
Cc:  "Keith Busch"<keith.busch@intel.com>; "Bjorn Helgaas"<helgaas@kernel.org>; "Jens Axboe"<axboe@kernel.dk>; "Sagi Grimberg"<sagi@grimberg.me>; "linux-pci"<linux-pci@vger.kernel.org>; "LKML"<linux-kernel@vger.kernel.org>; "linux-nvme"<linux-nvme@lists.infradead.org>; "Ming Lei"<ming.lei@redhat.com>; "linux-block"<linux-block@vger.kernel.org>; "Christoph Hellwig"<hch@lst.de>;
Subject:  Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
 
On Thu, 14 Feb 2019, 陈华才 wrote:

Please do not top post....

> I'll test next week, but 4.19 has the same problem, how to fix that for 4.19?

By applying the very same patch perhaps?

Thanks,

tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-19  0:42                   ` 陈华才
@ 2019-02-19  6:19                     ` Thomas Gleixner
  2019-02-19 16:12                     ` Keith Busch
  1 sibling, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2019-02-19  6:19 UTC (permalink / raw)
  To: 陈华才
  Cc: Huacai Chen, Keith Busch, Bjorn Helgaas, Jens Axboe,
	Sagi Grimberg, linux-pci, LKML, linux-nvme, Ming Lei,
	linux-block, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]

On Tue, 19 Feb 2019, 陈华才 wrote:
> 
> I've tested, this patch can fix the nvme problem, but it can't be applied
> to 4.19 because of different context. And, I still think my original solution
> (genirq/affinity: Assign default affinity to pre/post vectors) is correct.
> There may be similar problems except nvme.

As I explained you in great length, it is not correct because it's just
papering over the underlying problem.

Please stop advertising semantically broken duct tape solutions.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const
  2019-02-19  0:42                   ` 陈华才
  2019-02-19  6:19                     ` Thomas Gleixner
@ 2019-02-19 16:12                     ` Keith Busch
  1 sibling, 0 replies; 24+ messages in thread
From: Keith Busch @ 2019-02-19 16:12 UTC (permalink / raw)
  To: 陈华才
  Cc: Thomas Gleixner, Huacai Chen, Bjorn Helgaas, Jens Axboe,
	Sagi Grimberg, linux-pci, LKML, linux-nvme, Ming Lei,
	linux-block, Christoph Hellwig

On Mon, Feb 18, 2019 at 04:42:27PM -0800, 陈华才 wrote:
> I've tested, this patch can fix the nvme problem, but it can't be applied
> to 4.19 because of different context. And, I still think my original solution
> (genirq/affinity: Assign default affinity to pre/post vectors) is correct.
> There may be similar problems except nvme.

I'll send a proper patch to fix the warning.  It's always a driver bug if
they try to extract affinity from a non-managed irq, but I don't think
this particular nvme instance is very urgent as the end result is the
same, just without the warning.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-02-19 16:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13 10:50 [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Ming Lei
2019-02-13 10:50 ` [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const Ming Lei
2019-02-13 15:04   ` Bjorn Helgaas
2019-02-13 20:56     ` Thomas Gleixner
2019-02-13 21:31       ` Keith Busch
2019-02-13 21:41         ` Thomas Gleixner
2019-02-13 22:37           ` Keith Busch
2019-02-14  8:50             ` Thomas Gleixner
2019-02-14 13:04               ` 陈华才
2019-02-14 13:31                 ` Thomas Gleixner
2019-02-19  0:42                   ` 陈华才
2019-02-19  6:19                     ` Thomas Gleixner
2019-02-19 16:12                     ` Keith Busch
2019-02-13 10:50 ` [PATCH V3 2/5] genirq/affinity: store irq set vectors in 'struct irq_affinity' Ming Lei
2019-02-13 15:07   ` Bjorn Helgaas
2019-02-13 10:50 ` [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors Ming Lei
2019-02-13 15:11   ` Bjorn Helgaas
2019-02-13 20:58     ` Thomas Gleixner
2019-02-13 10:50 ` [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets Ming Lei
2019-02-13 15:13   ` Bjorn Helgaas
2019-02-13 21:26     ` Thomas Gleixner
2019-02-13 10:50 ` [PATCH V3 5/5] genirq/affinity: Document .calc_sets as required in case of multiple sets Ming Lei
2019-02-13 15:16   ` Bjorn Helgaas
2019-02-13 14:36 ` [PATCH V3 0/5] genirq/affinity: add .calc_sets for improving IRQ allocation & spread Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).