automatic interrupt affinity for MSI/MSI-X capable devices V3

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* automatic interrupt affinity for MSI/MSI-X capable devices V3
@ 2016-07-04  8:39 Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node.  For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API.  It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demonstrates all this
using the NVMe driver.

Compared to the last posting the core IRQ changes are stable and it would
be great to get them merged int the tip tree.  The two PCI patches have
been completely rewritten after feedback from Alexander, while the block
changes have also been stable.

There also is a git tree available at:

   git://git.infradead.org/users/hch/block.git msix-spreading.6

Gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/msix-spreading.6

Changes since V2:
 - improve the description of IRQD_AFFINITY_MANAGED
 - update MSI-HOWTO.txt
 - add a PCI_IRQ_NOMSI flag to avoid using MSI vectors
 - add a PCI_IRQ_NOAFFINITY flag to skip auto affinity
 - change the irq_create_affinity_mask calling convention
 - rewrite pci_alloc_irq_vectors to create the affinity mask only after
   we know the final vector count
 - cleanup pci_free_irq_vectors
 - replace pdev->irqs with pdev->msix_vectors and introduce
   a pci_irq_vector helper to get the Linux IRQ numbers

Changes since V1:
 - irq core improvements to properly assign the affinity before
   request_irq (tglx)
 - better handling of the MSI vs MSI-X differences in the low level
   MSI allocator (hch and tglx)
 - various improvements to pci_alloc_irq_vectors (hch)
 - remove blk-mq hardware queue reassigned on hotplug cpu events (hch)
 - forward ported to Jens' current for-next tree (hch)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:32   ` [tip:irq/core] genirq/msi: " tip-bot for Thomas Gleixner
  2016-07-04  8:39 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

No user and we definitely don't want to grow one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
 include/linux/msi.h | 6 ++----
 kernel/irq/msi.c    | 8 ++------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8b425c6..c33abfa 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -264,12 +264,10 @@ enum {
 	 * callbacks.
 	 */
 	MSI_FLAG_USE_DEF_CHIP_OPS	= (1 << 1),
-	/* Build identity map between hwirq and irq */
-	MSI_FLAG_IDENTITY_MAP		= (1 << 2),
 	/* Support multiple PCI MSI interrupts */
-	MSI_FLAG_MULTI_PCI_MSI		= (1 << 3),
+	MSI_FLAG_MULTI_PCI_MSI		= (1 << 2),
 	/* Support PCI MSIX interrupts */
-	MSI_FLAG_PCI_MSIX		= (1 << 4),
+	MSI_FLAG_PCI_MSIX		= (1 << 3),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 38e89ce..eb5bf2b 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -324,7 +324,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_domain_ops *ops = info->ops;
 	msi_alloc_info_t arg;
 	struct msi_desc *desc;
-	int i, ret, virq = -1;
+	int i, ret, virq;
 
 	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
 	if (ret)
@@ -332,12 +332,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	for_each_msi_entry(desc, dev) {
 		ops->set_desc(&arg, desc);
-		if (info->flags & MSI_FLAG_IDENTITY_MAP)
-			virq = (int)ops->get_hwirq(info, &arg);
-		else
-			virq = -1;
 
-		virq = __irq_domain_alloc_irqs(domain, virq, desc->nvec_used,
+		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false);
 		if (virq < 0) {
 			ret = -ENOSPC;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:32   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
  2016-07-04  8:39 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq.h    |  7 +++++++
 kernel/irq/internals.h |  2 ++
 kernel/irq/manage.c    | 21 ++++++++++++++++++---
 kernel/irq/proc.c      |  2 +-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 4d758a7..f607481 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -197,6 +197,7 @@ struct irq_data {
  * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
  * IRQD_WAKEUP_ARMED		- Wakeup mode armed
  * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
+ * IRQD_AFFINITY_MANAGED	- Affinity is auto-managed by the kernel
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -212,6 +213,7 @@ enum {
 	IRQD_IRQ_INPROGRESS		= (1 << 18),
 	IRQD_WAKEUP_ARMED		= (1 << 19),
 	IRQD_FORWARDED_TO_VCPU		= (1 << 20),
+	IRQD_AFFINITY_MANAGED		= (1 << 21),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -305,6 +307,11 @@ static inline void irqd_clr_forwarded_to_vcpu(struct irq_data *d)
 	__irqd_to_state(d) &= ~IRQD_FORWARDED_TO_VCPU;
 }
 
+static inline bool irqd_affinity_is_managed(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 09be2c9..b15aa3b 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -105,6 +105,8 @@ static inline void unregister_handler_proc(unsigned int irq,
 					   struct irqaction *action) { }
 #endif
 
+extern bool irq_can_set_affinity_usr(unsigned int irq);
+
 extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask);
 
 extern void irq_set_thread_affinity(struct irq_desc *desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ef0bc02..30658e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -115,12 +115,12 @@ EXPORT_SYMBOL(synchronize_irq);
 #ifdef CONFIG_SMP
 cpumask_var_t irq_default_affinity;
 
-static int __irq_can_set_affinity(struct irq_desc *desc)
+static bool __irq_can_set_affinity(struct irq_desc *desc)
 {
 	if (!desc || !irqd_can_balance(&desc->irq_data) ||
 	    !desc->irq_data.chip || !desc->irq_data.chip->irq_set_affinity)
-		return 0;
-	return 1;
+		return false;
+	return true;
 }
 
 /**
@@ -134,6 +134,21 @@ int irq_can_set_affinity(unsigned int irq)
 }
 
 /**
+ * irq_can_set_affinity_usr - Check if affinity of a irq can be set from user space
+ * @irq:	Interrupt to check
+ *
+ * Like irq_can_set_affinity() above, but additionally checks for the
+ * AFFINITY_MANAGED flag.
+ */
+bool irq_can_set_affinity_usr(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	return __irq_can_set_affinity(desc) &&
+		!irqd_affinity_is_managed(&desc->irq_data);
+}
+
+/**
  *	irq_set_thread_affinity - Notify irq threads to adjust affinity
  *	@desc:		irq descriptor which has affitnity changed
  *
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4e1b947..40bdcdc 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -96,7 +96,7 @@ static ssize_t write_irq_affinity(int type, struct file *file,
 	cpumask_var_t new_value;
 	int err;
 
-	if (!irq_can_set_affinity(irq) || no_irq_affinity)
+	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
 		return -EIO;
 
 	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/13] irq: Add affinity hint to irq allocation
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:33   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
  2016-07-04  8:39 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/sparc/kernel/irq_64.c     |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h            |  4 ++--
 include/linux/irqdomain.h      |  9 ++++++---
 kernel/irq/ipi.c               |  2 +-
 kernel/irq/irqdesc.c           | 12 ++++++++----
 kernel/irq/irqdomain.c         | 22 ++++++++++++++--------
 kernel/irq/manage.c            |  7 ++++---
 kernel/irq/msi.c               |  3 ++-
 9 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..34a7930 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
 {
 	int irq;
 
-	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, NULL);
 	if (irq <= 0)
 		goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 84e33ff..bca0c81 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi,
 
 	return __irq_domain_alloc_irqs(domain, irq, 1,
 				       ioapic_alloc_attr_node(info),
-				       info, legacy);
+				       info, legacy, NULL);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain,
 					  info->ioapic_pin))
 			return -ENOMEM;
 	} else {
-		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+					      NULL);
 		if (irq >= 0) {
 			irq_data = irq_domain_get_irq_data(domain, irq);
 			data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f607481..39ce46a 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -708,11 +708,11 @@ static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		struct module *owner);
+		      struct module *owner, const struct cpumask *affinity);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
-	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index f1f36e0..1aee0fb 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -39,6 +39,7 @@ struct irq_domain;
 struct of_device_id;
 struct irq_chip;
 struct irq_data;
+struct cpumask;
 
 /* Number of irqs reserved for a legacy isa controller */
 #define NUM_ISA_INTERRUPTS	16
@@ -217,7 +218,8 @@ extern struct irq_domain *irq_find_matching_fwspec(struct irq_fwspec *fwspec,
 						   enum irq_domain_bus_token bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
-				  irq_hw_number_t hwirq, int node);
+				  irq_hw_number_t hwirq, int node,
+				  const struct cpumask *affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -389,7 +391,7 @@ static inline struct irq_domain *irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 				   unsigned int nr_irqs, int node, void *arg,
-				   bool realloc);
+				   bool realloc, const struct cpumask *affinity);
 extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs);
 extern void irq_domain_activate_irq(struct irq_data *irq_data);
 extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
@@ -397,7 +399,8 @@ extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
 			unsigned int nr_irqs, int node, void *arg)
 {
-	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false);
+	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false,
+				       NULL);
 }
 
 extern int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 89b49f6..4fd2351 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -76,7 +76,7 @@ int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return -ENOMEM;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 8731e1c..b8df4fc 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -223,7 +223,7 @@ static void free_desc(unsigned int irq)
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
-		       struct module *owner)
+		       const struct cpumask *affinity, struct module *owner)
 {
 	struct irq_desc *desc;
 	int i;
@@ -333,6 +333,7 @@ static void free_desc(unsigned int irq)
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
+			      const struct cpumask *affinity,
 			      struct module *owner)
 {
 	u32 i;
@@ -453,12 +454,15 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @cnt:	Number of consecutive irqs to allocate.
  * @node:	Preferred node on which the irq descriptor should be allocated
  * @owner:	Owning module (can be NULL)
+ * @affinity:	Optional pointer to an affinity mask which hints where the
+ *		irq descriptors should be allocated and which default
+ *		affinities to use
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		  struct module *owner)
+		  struct module *owner, const struct cpumask *affinity)
 {
 	int start, ret;
 
@@ -494,7 +498,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
 	bitmap_set(allocated_irqs, start, cnt);
 	mutex_unlock(&sparse_irq_lock);
-	return alloc_descs(start, cnt, node, owner);
+	return alloc_descs(start, cnt, node, affinity, owner);
 
 err:
 	mutex_unlock(&sparse_irq_lock);
@@ -512,7 +516,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_alloc_hwirqs(int cnt, int node)
 {
-	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL);
+	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL, NULL);
 
 	if (irq < 0)
 		return 0;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 8798b6c..79459b7 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -481,7 +481,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node));
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -835,19 +835,23 @@ const struct irq_domain_ops irq_domain_simple_ops = {
 EXPORT_SYMBOL_GPL(irq_domain_simple_ops);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node)
+			   int node, const struct cpumask *affinity)
 {
 	unsigned int hint;
 
 	if (virq >= 0) {
-		virq = irq_alloc_descs(virq, virq, cnt, node);
+		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
+					 affinity);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
-		virq = irq_alloc_descs_from(hint, cnt, node);
-		if (virq <= 0 && hint > 1)
-			virq = irq_alloc_descs_from(1, cnt, node);
+		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
+					 affinity);
+		if (virq <= 0 && hint > 1) {
+			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
+						 affinity);
+		}
 	}
 
 	return virq;
@@ -1160,6 +1164,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  * @node:	NUMA node id for memory allocation
  * @arg:	domain specific argument
  * @realloc:	IRQ descriptors have already been allocated if true
+ * @affinity:	Optional irq affinity mask for multiqueue devices
  *
  * Allocate IRQ numbers and initialized all data structures to support
  * hierarchy IRQ domains.
@@ -1175,7 +1180,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  */
 int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 			    unsigned int nr_irqs, int node, void *arg,
-			    bool realloc)
+			    bool realloc, const struct cpumask *affinity)
 {
 	int i, ret, virq;
 
@@ -1193,7 +1198,8 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 	if (realloc && irq_base >= 0) {
 		virq = irq_base;
 	} else {
-		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node);
+		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
+					      affinity);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 30658e9..ad0aac6 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -353,10 +353,11 @@ static int setup_affinity(struct irq_desc *desc, struct cpumask *mask)
 		return 0;
 
 	/*
-	 * Preserve an userspace affinity setup, but make sure that
-	 * one of the targets is online.
+	 * Preserve the managed affinity setting and an userspace affinity
+	 * setup, but make sure that one of the targets is online.
 	 */
-	if (irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
+	if (irqd_affinity_is_managed(&desc->irq_data) ||
+	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_common_data.affinity,
 				       cpu_online_mask))
 			set = desc->irq_common_data.affinity;
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index eb5bf2b..58dbbac 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -334,7 +334,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
-					       dev_to_node(dev), &arg, false);
+					       dev_to_node(dev), &arg, false,
+					       NULL);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/13] irq: Use affinity hint in irqdesc allocation
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:33   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
  2016-07-04  8:39 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/irqdesc.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b8df4fc..a623b44 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -68,9 +68,13 @@ static int alloc_masks(struct irq_desc *desc, gfp_t gfp, int node)
 	return 0;
 }
 
-static void desc_smp_init(struct irq_desc *desc, int node)
+static void desc_smp_init(struct irq_desc *desc, int node,
+			  const struct cpumask *affinity)
 {
-	cpumask_copy(desc->irq_common_data.affinity, irq_default_affinity);
+	if (!affinity)
+		affinity = irq_default_affinity;
+	cpumask_copy(desc->irq_common_data.affinity, affinity);
+
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_clear(desc->pending_mask);
 #endif
@@ -82,11 +86,12 @@ static void desc_smp_init(struct irq_desc *desc, int node)
 #else
 static inline int
 alloc_masks(struct irq_desc *desc, gfp_t gfp, int node) { return 0; }
-static inline void desc_smp_init(struct irq_desc *desc, int node) { }
+static inline void
+desc_smp_init(struct irq_desc *desc, int node, const struct cpumask *affinity) { }
 #endif
 
 static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
-		struct module *owner)
+			      const struct cpumask *affinity, struct module *owner)
 {
 	int cpu;
 
@@ -107,7 +112,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
-	desc_smp_init(desc, node);
+	desc_smp_init(desc, node, affinity);
 }
 
 int nr_irqs = NR_IRQS;
@@ -158,7 +163,9 @@ void irq_unlock_sparse(void)
 	mutex_unlock(&sparse_irq_lock);
 }
 
-static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
+static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
+				   const struct cpumask *affinity,
+				   struct module *owner)
 {
 	struct irq_desc *desc;
 	gfp_t gfp = GFP_KERNEL;
@@ -178,7 +185,8 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
 	lockdep_set_class(&desc->lock, &irq_desc_lock_class);
 	init_rcu_head(&desc->rcu);
 
-	desc_set_defaults(irq, desc, node, owner);
+	desc_set_defaults(irq, desc, node, affinity, owner);
+	irqd_set(&desc->irq_data, flags);
 
 	return desc;
 
@@ -225,11 +233,30 @@ static void free_desc(unsigned int irq)
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		       const struct cpumask *affinity, struct module *owner)
 {
+	const struct cpumask *mask = NULL;
 	struct irq_desc *desc;
-	int i;
+	unsigned int flags;
+	int i, cpu = -1;
+
+	if (affinity && cpumask_empty(affinity))
+		return -EINVAL;
+
+	flags = affinity ? IRQD_AFFINITY_MANAGED : 0;
 
 	for (i = 0; i < cnt; i++) {
-		desc = alloc_desc(start + i, node, owner);
+		if (affinity) {
+			cpu = cpumask_next(cpu, affinity);
+			if (cpu >= nr_cpu_ids)
+				cpu = cpumask_first(affinity);
+			node = cpu_to_node(cpu);
+
+			/*
+			 * For single allocations we use the caller provided
+			 * mask otherwise we use the mask of the target cpu
+			 */
+			mask = cnt == 1 ? affinity : cpumask_of(cpu);
+		}
+		desc = alloc_desc(start + i, node, flags, mask, owner);
 		if (!desc)
 			goto err;
 		mutex_lock(&sparse_irq_lock);
@@ -277,7 +304,7 @@ int __init early_irq_init(void)
 		nr_irqs = initcnt;
 
 	for (i = 0; i < initcnt; i++) {
-		desc = alloc_desc(i, node, NULL);
+		desc = alloc_desc(i, node, 0, NULL, NULL);
 		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
@@ -311,7 +338,7 @@ int __init early_irq_init(void)
 		alloc_masks(&desc[i], GFP_KERNEL, node);
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
-		desc_set_defaults(i, &desc[i], node, NULL);
+		desc_set_defaults(i, &desc[i], node, NULL, NULL);
 	}
 	return arch_early_irq_init();
 }
@@ -328,7 +355,7 @@ static void free_desc(unsigned int irq)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL);
+	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/13] irq/msi: Make use of affinity aware allocations
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:33   ` [tip:irq/core] genirq/msi: " tip-bot for Thomas Gleixner
  2016-07-04  8:39 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Allow the MSI code to provide affinity hints per MSI descriptor.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h | 2 ++
 kernel/irq/msi.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index c33abfa..4f0bfe5 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -47,6 +47,7 @@ struct fsl_mc_msi_desc {
  * @nvec_used:	The number of vectors used
  * @dev:	Pointer to the device which uses this descriptor
  * @msg:	The last set MSI message cached for reuse
+ * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
  *
  * @masked:	[PCI MSI/X] Mask bits
  * @is_msix:	[PCI MSI/X] True if MSI-X
@@ -67,6 +68,7 @@ struct msi_desc {
 	unsigned int			nvec_used;
 	struct device			*dev;
 	struct msi_msg			msg;
+	const struct cpumask		*affinity;
 
 	union {
 		/* PCI MSI/X specific data */
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 58dbbac..0e2a736 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -335,7 +335,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false,
-					       NULL);
+					       desc->affinity);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:34   ` [tip:irq/core] genirq: Add a helper to " tip-bot for Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

This is lifted from the blk-mq code and adopted to use the affinity mask
concept just introduced in the irq handling code.  It tries to keep the
algorithm the same as the one current used by blk-mq, but improvements
like assining vectors on a per-node basis instead of just per sibling
are possible on of this simple move and refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/interrupt.h |  8 +++++++
 kernel/irq/Makefile       |  1 +
 kernel/irq/affinity.c     | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)
 create mode 100644 kernel/irq/affinity.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..b6683f0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,8 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
+struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +310,12 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	return 0;
 }
+
+static inline struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
+{
+	*nr_vecs = 1;
+	return NULL;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 0000000..f689593
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,61 @@
+
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+
+static int get_first_sibling(unsigned int cpu)
+{
+	unsigned int ret;
+
+	ret = cpumask_first(topology_sibling_cpumask(cpu));
+	if (ret < nr_cpu_ids)
+		return ret;
+	return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assigning the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
+{
+	struct cpumask *affinity_mask;
+	unsigned int max_vecs = *nr_vecs;
+
+	if (max_vecs == 1)
+		return NULL;
+
+	affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!affinity_mask) {
+		*nr_vecs = 1;
+		return NULL;
+	}
+
+	if (max_vecs >= num_online_cpus()) {
+		cpumask_copy(affinity_mask, cpu_online_mask);
+		*nr_vecs = num_online_cpus();
+	} else {
+		unsigned int vecs = 0, cpu;
+
+		for_each_online_cpu(cpu) {
+			if (cpu == get_first_sibling(cpu)) {
+				cpumask_set_cpu(cpu, affinity_mask);
+				vecs++;
+			}
+
+			if (--max_vecs == 0)
+				break;
+		}
+		*nr_vecs = vecs;
+	}
+
+	return affinity_mask;
+}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-06  8:05   ` Alexander Gordeev
  2016-07-04  8:39 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Add a function to allocate a range of interrupt vectors, which will
transparently use MSI-X and MSI if available or fallback to legacy
vectors.  The interrupts are available in a core managed array in the
pci_dev structure, and can also be released using a similar function.
A new helper, __pci_enable_msix_range, is introduced to allow allocating
the array of msix descriptors in the core PCIe code at the exact number
of vectors supported by the PCI host complex, and to also help with the
automatic IRQ affinity assignment that will be added in the next commit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/PCI/MSI-HOWTO.txt | 437 ++++------------------------------------
 drivers/pci/msi.c               |  93 +++++++++
 include/linux/pci.h             |  39 ++++
 3 files changed, 170 insertions(+), 399 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 1179850..35d1326 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -82,418 +82,57 @@ Most of the hard work is done for the driver in the PCI layer.  It simply
 has to request that the PCI layer set up the MSI capability for this
 device.
 
-4.2.1 pci_enable_msi
+To automatically use MSI or MSI-X interrupt vectors use the following
+function:
 
-int pci_enable_msi(struct pci_dev *dev)
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs, unsigned int flags)
 
-A successful call allocates ONE interrupt to the device, regardless
-of how many MSIs the device supports.  The device is switched from
-pin-based interrupt mode to MSI mode.  The dev->irq number is changed
-to a new number which represents the message signaled interrupt;
-consequently, this function should be called before the driver calls
-request_irq(), because an MSI is delivered via a vector that is
-different from the vector of a pin-based interrupt.
+Which allocates up to max_vecs interrupt vectors for a PCI devices.  Returns
+the number of vectors allocated or a negative error.  If the device has a
+requirements for a minimum number of vectors the driver can pass a min_vecs
+argument set to this limit, and the PCI core will return -ENOSPC if it can't
+meet the minimum number of vectors.
+The flags argument should normally be set to 0, but can be used to
+pass the PCI_IRQ_NOMSI and PCI_IRQ_NOMSIX flag in case a device claims
+to support MSI or MSI-X, but the support is broken.
 
-4.2.2 pci_enable_msi_range
+To get the Linux IRQ numbers passed to request_irq and free_irq
+and the vectors use the following function:
 
-int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
+int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 
-This function allows a device driver to request any number of MSI
-interrupts within specified range from 'minvec' to 'maxvec'.
+Any allocated resources should be freed before removing the device using the
+following function:
 
-If this function returns a positive number it indicates the number of
-MSI interrupts that have been successfully allocated.  In this case
-the device is switched from pin-based interrupt mode to MSI mode and
-updates dev->irq to be the lowest of the new interrupts assigned to it.
-The other interrupts assigned to the device are in the range dev->irq
-to dev->irq + returned value - 1.  Device driver can use the returned
-number of successfully allocated MSI interrupts to further allocate
-and initialize device resources.
+void pci_free_irq_vectors(struct pci_dev *dev)
 
-If this function returns a negative number, it indicates an error and
-the driver should not attempt to request any more MSI interrupts for
-this device.
+If a device supports both MSI-X and MSI capabilities, this API will use
+the MSI-X facilities in preference to the MSI facilities.   MSI-X
+supports any number of interrupts between 1 and 2048. In contrast, MSI
+is restricted to a maximum of 32 interrupts (and must be a power of two).
+In addition, the MSI interrupt vectors must be allocated consecutively,
+so the system might not be able to allocate as many vectors for MSI as
+it could for MSI-X.  On some platforms, MSI interrupts must all be
+targeted at the same set of CPUs whereas MSI-X interrupts can all be
+targeted at different CPUs.
 
-This function should be called before the driver calls request_irq(),
-because MSI interrupts are delivered via vectors that are different
-from the vector of a pin-based interrupt.
+If a device supports neither MSI-X or MSI it will fall back to a single
+legacy IRQ vectors.
 
-It is ideal if drivers can cope with a variable number of MSI interrupts;
-there are many reasons why the platform may not be able to provide the
-exact number that a driver asks for.
+4.3 Legacy MSI / MSI-X APIs
 
-There could be devices that can not operate with just any number of MSI
-interrupts within a range.  See chapter 4.3.1.3 to get the idea how to
-handle such devices for MSI-X - the same logic applies to MSI.
+The old MSI or MSI-X specific APIs:
 
-4.2.1.1 Maximum possible number of MSI interrupts
+ pci_enable_msi, pci_enable_msi_range, pci_enable_msi_exact, pci_disable_msi,
+ pci_msi_vec_count, pci_enable_msix_range, pci_enable_msix_exact,
+ pci_disable_msix, pci_msix_vec_count
 
-The typical usage of MSI interrupts is to allocate as many vectors as
-possible, likely up to the limit returned by pci_msi_vec_count() function:
+should not be used in new code.
 
-static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
-{
-	return pci_enable_msi_range(pdev, 1, nvec);
-}
+4.4 Considerations when using MSIs
 
-Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
-the value of 0 would be meaningless and could result in error.
-
-Some devices have a minimal limit on number of MSI interrupts.
-In this case the function could look like this:
-
-static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
-{
-	return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
-}
-
-4.2.1.2 Exact number of MSI interrupts
-
-If a driver is unable or unwilling to deal with a variable number of MSI
-interrupts it could request a particular number of interrupts by passing
-that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
-parameters:
-
-static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
-{
-	return pci_enable_msi_range(pdev, nvec, nvec);
-}
-
-Note, unlike pci_enable_msi_exact() function, which could be also used to
-enable a particular number of MSI-X interrupts, pci_enable_msi_range()
-returns either a negative errno or 'nvec' (not negative errno or 0 - as
-pci_enable_msi_exact() does).
-
-4.2.1.3 Single MSI mode
-
-The most notorious example of the request type described above is
-enabling the single MSI mode for a device.  It could be done by passing
-two 1s as 'minvec' and 'maxvec':
-
-static int foo_driver_enable_single_msi(struct pci_dev *pdev)
-{
-	return pci_enable_msi_range(pdev, 1, 1);
-}
-
-Note, unlike pci_enable_msi() function, which could be also used to
-enable the single MSI mode, pci_enable_msi_range() returns either a
-negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
-does).
-
-4.2.3 pci_enable_msi_exact
-
-int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
-
-This variation on pci_enable_msi_range() call allows a device driver to
-request exactly 'nvec' MSIs.
-
-If this function returns a negative number, it indicates an error and
-the driver should not attempt to request any more MSI interrupts for
-this device.
-
-By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
-returns zero in case of success, which indicates MSI interrupts have been
-successfully allocated.
-
-4.2.4 pci_disable_msi
-
-void pci_disable_msi(struct pci_dev *dev)
-
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev->irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev->irq.
-
-Before calling this function, a device driver must always call free_irq()
-on any interrupt for which it previously called request_irq().
-Failure to do so results in a BUG_ON(), leaving the device with
-MSI enabled and thus leaking its vector.
-
-4.2.4 pci_msi_vec_count
-
-int pci_msi_vec_count(struct pci_dev *dev)
-
-This function could be used to retrieve the number of MSI vectors the
-device requested (via the Multiple Message Capable register). The MSI
-specification only allows the returned value to be a power of two,
-up to a maximum of 2^5 (32).
-
-If this function returns a negative number, it indicates the device is
-not capable of sending MSIs.
-
-If this function returns a positive number, it indicates the maximum
-number of MSI interrupt vectors that could be allocated.
-
-4.3 Using MSI-X
-
-The MSI-X capability is much more flexible than the MSI capability.
-It supports up to 2048 interrupts, each of which can be controlled
-independently.  To support this flexibility, drivers must use an array of
-`struct msix_entry':
-
-struct msix_entry {
-	u16 	vector; /* kernel uses to write alloc vector */
-	u16	entry; /* driver uses to specify entry */
-};
-
-This allows for the device to use these interrupts in a sparse fashion;
-for example, it could use interrupts 3 and 1027 and yet allocate only a
-two-element array.  The driver is expected to fill in the 'entry' value
-in each element of the array to indicate for which entries the kernel
-should assign interrupts; it is invalid to fill in two entries with the
-same number.
-
-4.3.1 pci_enable_msix_range
-
-int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
-			  int minvec, int maxvec)
-
-Calling this function asks the PCI subsystem to allocate any number of
-MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
-The 'entries' argument is a pointer to an array of msix_entry structs
-which should be at least 'maxvec' entries in size.
-
-On success, the device is switched into MSI-X mode and the function
-returns the number of MSI-X interrupts that have been successfully
-allocated.  In this case the 'vector' member in entries numbered from
-0 to the returned value - 1 is populated with the interrupt number;
-the driver should then call request_irq() for each 'vector' that it
-decides to use.  The device driver is responsible for keeping track of the
-interrupts assigned to the MSI-X vectors so it can free them again later.
-Device driver can use the returned number of successfully allocated MSI-X
-interrupts to further allocate and initialize device resources.
-
-If this function returns a negative number, it indicates an error and
-the driver should not attempt to allocate any more MSI-X interrupts for
-this device.
-
-This function, in contrast with pci_enable_msi_range(), does not adjust
-dev->irq.  The device will not generate interrupts for this interrupt
-number once MSI-X is enabled.
-
-Device drivers should normally call this function once per device
-during the initialization phase.
-
-It is ideal if drivers can cope with a variable number of MSI-X interrupts;
-there are many reasons why the platform may not be able to provide the
-exact number that a driver asks for.
-
-There could be devices that can not operate with just any number of MSI-X
-interrupts within a range.  E.g., an network adapter might need let's say
-four vectors per each queue it provides.  Therefore, a number of MSI-X
-interrupts allocated should be a multiple of four.  In this case interface
-pci_enable_msix_range() can not be used alone to request MSI-X interrupts
-(since it can allocate any number within the range, without any notion of
-the multiple of four) and the device driver should master a custom logic
-to request the required number of MSI-X interrupts.
-
-4.3.1.1 Maximum possible number of MSI-X interrupts
-
-The typical usage of MSI-X interrupts is to allocate as many vectors as
-possible, likely up to the limit returned by pci_msix_vec_count() function:
-
-static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
-{
-	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
-				     1, nvec);
-}
-
-Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
-the value of 0 would be meaningless and could result in error.
-
-Some devices have a minimal limit on number of MSI-X interrupts.
-In this case the function could look like this:
-
-static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
-{
-	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
-				     FOO_DRIVER_MINIMUM_NVEC, nvec);
-}
-
-4.3.1.2 Exact number of MSI-X interrupts
-
-If a driver is unable or unwilling to deal with a variable number of MSI-X
-interrupts it could request a particular number of interrupts by passing
-that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
-parameters:
-
-static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
-{
-	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
-				     nvec, nvec);
-}
-
-Note, unlike pci_enable_msix_exact() function, which could be also used to
-enable a particular number of MSI-X interrupts, pci_enable_msix_range()
-returns either a negative errno or 'nvec' (not negative errno or 0 - as
-pci_enable_msix_exact() does).
-
-4.3.1.3 Specific requirements to the number of MSI-X interrupts
-
-As noted above, there could be devices that can not operate with just any
-number of MSI-X interrupts within a range.  E.g., let's assume a device that
-is only capable sending the number of MSI-X interrupts which is a power of
-two.  A routine that enables MSI-X mode for such device might look like this:
-
-/*
- * Assume 'minvec' and 'maxvec' are non-zero
- */
-static int foo_driver_enable_msix(struct foo_adapter *adapter,
-				  int minvec, int maxvec)
-{
-	int rc;
-
-	minvec = roundup_pow_of_two(minvec);
-	maxvec = rounddown_pow_of_two(maxvec);
-
-	if (minvec > maxvec)
-		return -ERANGE;
-
-retry:
-	rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
-				   maxvec, maxvec);
-	/*
-	 * -ENOSPC is the only error code allowed to be analyzed
-	 */
-	if (rc == -ENOSPC) {
-		if (maxvec == 1)
-			return -ENOSPC;
-
-		maxvec /= 2;
-
-		if (minvec > maxvec)
-			return -ENOSPC;
-
-		goto retry;
-	}
-
-	return rc;
-}
-
-Note how pci_enable_msix_range() return value is analyzed for a fallback -
-any error code other than -ENOSPC indicates a fatal error and should not
-be retried.
-
-4.3.2 pci_enable_msix_exact
-
-int pci_enable_msix_exact(struct pci_dev *dev,
-			  struct msix_entry *entries, int nvec)
-
-This variation on pci_enable_msix_range() call allows a device driver to
-request exactly 'nvec' MSI-Xs.
-
-If this function returns a negative number, it indicates an error and
-the driver should not attempt to allocate any more MSI-X interrupts for
-this device.
-
-By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
-returns zero in case of success, which indicates MSI-X interrupts have been
-successfully allocated.
-
-Another version of a routine that enables MSI-X mode for a device with
-specific requirements described in chapter 4.3.1.3 might look like this:
-
-/*
- * Assume 'minvec' and 'maxvec' are non-zero
- */
-static int foo_driver_enable_msix(struct foo_adapter *adapter,
-				  int minvec, int maxvec)
-{
-	int rc;
-
-	minvec = roundup_pow_of_two(minvec);
-	maxvec = rounddown_pow_of_two(maxvec);
-
-	if (minvec > maxvec)
-		return -ERANGE;
-
-retry:
-	rc = pci_enable_msix_exact(adapter->pdev,
-				   adapter->msix_entries, maxvec);
-
-	/*
-	 * -ENOSPC is the only error code allowed to be analyzed
-	 */
-	if (rc == -ENOSPC) {
-		if (maxvec == 1)
-			return -ENOSPC;
-
-		maxvec /= 2;
-
-		if (minvec > maxvec)
-			return -ENOSPC;
-
-		goto retry;
-	} else if (rc < 0) {
-		return rc;
-	}
-
-	return maxvec;
-}
-
-4.3.3 pci_disable_msix
-
-void pci_disable_msix(struct pci_dev *dev)
-
-This function should be used to undo the effect of pci_enable_msix_range().
-It frees the previously allocated MSI-X interrupts. The interrupts may
-subsequently be assigned to another device, so drivers should not cache
-the value of the 'vector' elements over a call to pci_disable_msix().
-
-Before calling this function, a device driver must always call free_irq()
-on any interrupt for which it previously called request_irq().
-Failure to do so results in a BUG_ON(), leaving the device with
-MSI-X enabled and thus leaking its vector.
-
-4.3.3 The MSI-X Table
-
-The MSI-X capability specifies a BAR and offset within that BAR for the
-MSI-X Table.  This address is mapped by the PCI subsystem, and should not
-be accessed directly by the device driver.  If the driver wishes to
-mask or unmask an interrupt, it should call disable_irq() / enable_irq().
-
-4.3.4 pci_msix_vec_count
-
-int pci_msix_vec_count(struct pci_dev *dev)
-
-This function could be used to retrieve number of entries in the device
-MSI-X table.
-
-If this function returns a negative number, it indicates the device is
-not capable of sending MSI-Xs.
-
-If this function returns a positive number, it indicates the maximum
-number of MSI-X interrupt vectors that could be allocated.
-
-4.4 Handling devices implementing both MSI and MSI-X capabilities
-
-If a device implements both MSI and MSI-X capabilities, it can
-run in either MSI mode or MSI-X mode, but not both simultaneously.
-This is a requirement of the PCI spec, and it is enforced by the
-PCI layer.  Calling pci_enable_msi_range() when MSI-X is already
-enabled or pci_enable_msix_range() when MSI is already enabled
-results in an error.  If a device driver wishes to switch between MSI
-and MSI-X at runtime, it must first quiesce the device, then switch
-it back to pin-interrupt mode, before calling pci_enable_msi_range()
-or pci_enable_msix_range() and resuming operation.  This is not expected
-to be a common operation but may be useful for debugging or testing
-during development.
-
-4.5 Considerations when using MSIs
-
-4.5.1 Choosing between MSI-X and MSI
-
-If your device supports both MSI-X and MSI capabilities, you should use
-the MSI-X facilities in preference to the MSI facilities.  As mentioned
-above, MSI-X supports any number of interrupts between 1 and 2048.
-In contrast, MSI is restricted to a maximum of 32 interrupts (and
-must be a power of two).  In addition, the MSI interrupt vectors must
-be allocated consecutively, so the system might not be able to allocate
-as many vectors for MSI as it could for MSI-X.  On some platforms, MSI
-interrupts must all be targeted at the same set of CPUs whereas MSI-X
-interrupts can all be targeted at different CPUs.
-
-4.5.2 Spinlocks
+4.4.1 Spinlocks
 
 Most device drivers have a per-device spinlock which is taken in the
 interrupt handler.  With pin-based interrupts or a single MSI, it is not
@@ -505,7 +144,7 @@ acquire the spinlock.  Such deadlocks can be avoided by using
 spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
 and acquire the lock (see Documentation/DocBook/kernel-locking).
 
-4.6 How to tell whether MSI/MSI-X is enabled on a device
+4.5 How to tell whether MSI/MSI-X is enabled on a device
 
 Using 'lspci -v' (as root) may show some devices with "MSI", "Message
 Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..6b0834d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -4,6 +4,7 @@
  *
  * Copyright (C) 2003-2004 Intel
  * Copyright (C) Tom Long Nguyen (tom.l.nguyen@intel.com)
+ * Copyright (C) 2016 Christoph Hellwig.
  */
 
 #include <linux/err.h>
@@ -1120,6 +1121,98 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
+static int __pci_enable_msix_range(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs)
+{
+	int vecs = max_vecs, ret, i;
+
+retry:
+	if (vecs < min_vecs)
+		return -ENOSPC;
+
+	dev->msix_vectors = kmalloc_array(vecs, sizeof(struct msix_entry),
+				GFP_KERNEL);
+	if (!dev->msix_vectors)
+		return -ENOMEM;
+
+	for (i = 0; i < vecs; i++)
+		dev->msix_vectors[i].entry = i;
+
+	ret = pci_enable_msix(dev, dev->msix_vectors, vecs);
+	if (ret)
+		goto out_fail;
+
+	return vecs;
+
+out_fail:
+	kfree(dev->msix_vectors);
+	dev->msix_vectors = NULL;
+
+	if (ret >= 0) {
+		/* retry with the actually supported number of vectors */
+		vecs = ret;
+		goto retry;
+	}
+
+	return ret;
+}
+
+/**
+ * pci_alloc_irq_vectors - allocate multiple IRQs for a device
+ * @dev:		PCI device to operate on
+ * @min_vecs:		minimum number of vectors required (must be >= 1)
+ * @max_vecs:		maximum (desired) number of vectors
+ * @flags:		flags or quirks for the allocation
+ *
+ * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI
+ * vectors if available, and fall back to a single legacy vector
+ * if neither is available.  Return the number of vectors allocated,
+ * (which might be smaller than @max_vecs) if successful, or a negative
+ * error code on error. If less than @min_vecs interrupt vectors are
+ * available for @dev the function will fail with -ENOSPC.
+ *
+ * To get the Linux irq number used for a vector that can be passed to
+ * request_irq use the pci_irq_vector() helper.
+ */
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs, unsigned int flags)
+{
+	int vecs;
+
+	if (!(flags & PCI_IRQ_NOMSIX)) {
+		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs);
+		if (vecs > 0)
+			return vecs;
+	}
+
+	if (!(flags & PCI_IRQ_NOMSI)) {
+		vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
+		if (vecs > 0)
+			return vecs;
+	}
+
+	/* use legacy irq if allowed */
+	if (min_vecs == 1)
+		return 1;
+	return -ENOSPC;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+/**
+ * pci_free_irq_vectors - free previously allocated IRQs for a device
+ * @dev:		PCI device to operate on
+ *
+ * Undoes the allocations and enabling in pci_alloc_irq_vectors().
+ */
+void pci_free_irq_vectors(struct pci_dev *dev)
+{
+	if (dev->msix_enabled)
+		kfree(dev->msix_vectors);
+	pci_disable_msix(dev);
+	pci_disable_msi(dev);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
+
 struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
 {
 	return to_pci_dev(desc->dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b67e4df..129871f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -320,6 +320,7 @@ struct pci_dev {
 	 * directly, use the values stored here. They might be different!
 	 */
 	unsigned int	irq;
+	struct msix_entry *msix_vectors;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
 
 	bool match_driver;		/* Skip attaching driver */
@@ -1237,6 +1238,9 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
 int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 		      unsigned int command_bits, u32 flags);
 
+#define PCI_IRQ_NOMSI		(1 << 0) /* don't try to use MSI interrupts */
+#define PCI_IRQ_NOMSIX		(1 << 1) /* don't try to use MSI-X interrupts */
+
 /* kmem_cache style wrapper around pci_alloc_consistent() */
 
 #include <linux/pci-dma.h>
@@ -1284,6 +1288,24 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 		return rc;
 	return 0;
 }
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs, unsigned int flags);
+void pci_free_irq_vectors(struct pci_dev *dev);
+
+
+/**
+ * pci_irq_vector - return Linux IRQ number of a device vector
+ * @dev: PCI device to operate on
+ * @nr: device-relative interrupt vector index (0-based).
+ */
+static inline unsigned int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
+{
+	if (dev->msix_enabled)
+		return dev->msix_vectors[nr].vector;
+
+	WARN_ON_ONCE(!dev->msi_enabled && nr > 0);
+	return dev->irq + nr;
+}
 #else
 static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
 static inline void pci_msi_shutdown(struct pci_dev *dev) { }
@@ -1307,6 +1329,23 @@ static inline int pci_enable_msix_range(struct pci_dev *dev,
 static inline int pci_enable_msix_exact(struct pci_dev *dev,
 		      struct msix_entry *entries, int nvec)
 { return -ENOSYS; }
+static inline int pci_alloc_irq_vectors(struct pci_dev *dev,
+		unsigned int min_vecs, unsigned int max_vecs,
+		unsigned int flags)
+{
+	if (min_vecs > 1)
+		return -ENOSPC;
+	return 1;
+}
+static inline void pci_free_irq_vectors(struct pci_dev *dev)
+{
+}
+
+static inline unsigned int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
+{
+	WARN_ON_ONCE(nr > 0);
+	return dev->irq;
+}
 #endif
 
 #ifdef CONFIG_PCIEPORTBUS
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-07 11:05   ` Alexander Gordeev
  2016-07-04  8:39 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Set the affinity_mask in the PCI device before allocating vectors so that
the affinity can be propagated through the MSI descriptor structures to
the core IRQ code.  Add a new helper __pci_enable_msi_range which is
similar to __pci_enable_msix_range introduced in the last patch so that
we can allocate the affinity mask in a self-contained fashion and for the
right number of vectors.

A new PCI_IRQ_NOAFFINITY flag is added to pci_alloc_irq_vectors so that
this function can also be used by drivers that don't wish to use the
automatic affinity assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/PCI/MSI-HOWTO.txt |  3 +-
 drivers/pci/msi.c               | 72 ++++++++++++++++++++++++++++++++++++++---
 include/linux/pci.h             |  2 ++
 3 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 35d1326..dcd3f6d 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -95,7 +95,8 @@ argument set to this limit, and the PCI core will return -ENOSPC if it can't
 meet the minimum number of vectors.
 The flags argument should normally be set to 0, but can be used to
 pass the PCI_IRQ_NOMSI and PCI_IRQ_NOMSIX flag in case a device claims
-to support MSI or MSI-X, but the support is broken.
+to support MSI or MSI-X, but the support is broken, or to PCI_IRQ_NOAFFINITY
+if the driver does not wish to use the automatic affinity assignment feature.
 
 To get the Linux IRQ numbers passed to request_irq and free_irq
 and the vectors use the following function:
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 6b0834d..7f38e07 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -568,6 +568,7 @@ static struct msi_desc *msi_setup_entry(struct pci_dev *dev, int nvec)
 	entry->msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
 	entry->msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
 	entry->nvec_used		= nvec;
+	entry->affinity			= dev->irq_affinity;
 
 	if (control & PCI_MSI_FLAGS_64BIT)
 		entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
@@ -679,10 +680,18 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			      struct msix_entry *entries, int nvec)
 {
+	const struct cpumask *mask = NULL;
 	struct msi_desc *entry;
-	int i;
+	int cpu = -1, i;
 
 	for (i = 0; i < nvec; i++) {
+		if (dev->irq_affinity) {
+			cpu = cpumask_next(cpu, dev->irq_affinity);
+			if (cpu >= nr_cpu_ids)
+				cpu = cpumask_first(dev->irq_affinity);
+			mask = cpumask_of(cpu);
+		}
+
 		entry = alloc_msi_entry(&dev->dev);
 		if (!entry) {
 			if (!i)
@@ -699,6 +708,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 		entry->msi_attrib.default_irq	= dev->irq;
 		entry->mask_base		= base;
 		entry->nvec_used		= 1;
+		entry->affinity			= mask;
 
 		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
 	}
@@ -1121,8 +1131,53 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
+static int __pci_enable_msi_range(struct pci_dev *dev, int min_vecs, int max_vecs,
+		unsigned int flags)
+{
+	int vecs, ret;
+
+	if (!pci_msi_supported(dev, min_vecs))
+		return -EINVAL;
+
+	vecs = pci_msi_vec_count(dev);
+	if (vecs < 0)
+		return vecs;
+	if (vecs < min_vecs)
+		return -EINVAL;
+	if (vecs > max_vecs)
+		vecs = max_vecs;
+
+retry:
+	if (vecs < min_vecs)
+		return -ENOSPC;
+
+	if (!(flags & PCI_IRQ_NOAFFINITY)) {
+		dev->irq_affinity = irq_create_affinity_mask(&vecs);
+		if (vecs < min_vecs) {
+			ret = -ERANGE;
+			goto out_fail;
+		}
+	}
+
+	ret = msi_capability_init(dev, vecs);
+	if (ret)
+		goto out_fail;
+
+	return vecs;
+
+out_fail:
+	kfree(dev->irq_affinity);
+	if (ret >= 0) {
+		/* retry with the actually supported number of vectors */
+		vecs = ret;
+		goto retry;
+	}
+
+	return ret;
+}
+
 static int __pci_enable_msix_range(struct pci_dev *dev, unsigned int min_vecs,
-		unsigned int max_vecs)
+		unsigned int max_vecs, unsigned int flags)
 {
 	int vecs = max_vecs, ret, i;
 
@@ -1138,6 +1193,13 @@ retry:
 	for (i = 0; i < vecs; i++)
 		dev->msix_vectors[i].entry = i;
 
+	if (!(flags & PCI_IRQ_NOAFFINITY)) {
+		dev->irq_affinity = irq_create_affinity_mask(&vecs);
+		ret = -ENOSPC;
+		if (vecs < min_vecs)
+			goto out_fail;
+	}
+
 	ret = pci_enable_msix(dev, dev->msix_vectors, vecs);
 	if (ret)
 		goto out_fail;
@@ -1147,6 +1209,8 @@ retry:
 out_fail:
 	kfree(dev->msix_vectors);
 	dev->msix_vectors = NULL;
+	kfree(dev->irq_affinity);
+	dev->irq_affinity = NULL;
 
 	if (ret >= 0) {
 		/* retry with the actually supported number of vectors */
@@ -1180,13 +1244,13 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 	int vecs;
 
 	if (!(flags & PCI_IRQ_NOMSIX)) {
-		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs);
+		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs, flags);
 		if (vecs > 0)
 			return vecs;
 	}
 
 	if (!(flags & PCI_IRQ_NOMSI)) {
-		vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
+		vecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, flags);
 		if (vecs > 0)
 			return vecs;
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 129871f..6a64c54 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	 */
 	unsigned int	irq;
 	struct msix_entry *msix_vectors;
+	struct cpumask	*irq_affinity;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
 
 	bool match_driver;		/* Skip attaching driver */
@@ -1240,6 +1241,7 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 
 #define PCI_IRQ_NOMSI		(1 << 0) /* don't try to use MSI interrupts */
 #define PCI_IRQ_NOMSIX		(1 << 1) /* don't try to use MSI-X interrupts */
+#define PCI_IRQ_NOAFFINITY	(1 << 2) /* don't auto-assign affinity */
 
 /* kmem_cache style wrapper around pci_alloc_consistent() */
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue NULL until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bc7166d..f972d32 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2114,8 +2114,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,

 	blk_mq_sysfs_unregister(q);

-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
-
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
 	 * we should change hctx numa_node according to new topology (this
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering from the additional indirection.  The same
would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c         | 22 ++++++++++++++--------
 include/linux/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f972d32..622cb22 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1930,7 +1930,6 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
-	kfree(q->mq_map);
 	q->mq_map = NULL;
 
 	kfree(q->queue_hw_ctx);
@@ -2029,9 +2028,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
-	if (!q->mq_map)
-		goto err_map;
+	q->mq_map = set->mq_map;
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
@@ -2081,8 +2078,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
-	kfree(q->mq_map);
-err_map:
 	kfree(q->queue_hw_ctx);
 err_percpu:
 	free_percpu(q->queue_ctx);
@@ -2304,14 +2299,22 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
+	set->mq_map = blk_mq_make_queue_map(set);
+	if (!set->mq_map)
+		goto out_free_tags;
+
 	if (blk_mq_alloc_rq_maps(set))
-		goto enomem;
+		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
-enomem:
+
+out_free_mq_map:
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
 	return -ENOMEM;
@@ -2327,6 +2330,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
 
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+
 	kfree(set->tags);
 	set->tags = NULL;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2498fdf..0a3b138 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -65,6 +65,7 @@ struct blk_mq_hw_ctx {
 };
 
 struct blk_mq_tag_set {
+	unsigned int		*mq_map;
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04  8:39 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in
behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  68 +++++++++++++++++++++++++---
 block/blk-mq.h         |   8 ----
 include/linux/blk-mq.h |   1 +
 5 files changed, 65 insertions(+), 134 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 622cb22..fdd8d3a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1954,6 +1955,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2253,6 +2270,30 @@ struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
 }
 EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 
+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+		const struct cpumask *affinity_mask)
+{
+	int queue = -1, cpu = 0;
+
+	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+			GFP_KERNEL, set->numa_node);
+	if (!set->mq_map)
+		return -ENOMEM;
+
+	if (!affinity_mask)
+		return 0;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue > 0)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2261,6 +2302,8 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
+	int ret;
+
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
 	if (!set->nr_hw_queues)
@@ -2299,11 +2342,26 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	set->mq_map = blk_mq_make_queue_map(set);
-	if (!set->mq_map)
-		goto out_free_tags;
+	/*
+	 * Use the passed in affinity mask if the driver provided one.
+	 */
+	if (set->affinity_mask) {
+		ret = blk_mq_create_mq_map(set, set->affinity_mask);
+		if (!set->mq_map)
+			goto out_free_tags;
+	} else {
+		struct cpumask *affinity_mask;
+
+		affinity_mask = irq_create_affinity_mask(&set->nr_hw_queues);
+		ret = blk_mq_create_mq_map(set, affinity_mask);
+		kfree(affinity_mask);
+
+		if (!set->mq_map)
+			goto out_free_tags;
+	}
 
-	if (blk_mq_alloc_rq_maps(set))
+	ret = blk_mq_alloc_rq_maps(set);
+	if (ret)
 		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
@@ -2317,7 +2375,7 @@ out_free_mq_map:
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..fe7e21f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,14 +45,6 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
- */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
-/*
  * sysfs helpers
  */
 extern int blk_mq_sysfs_register(struct request_queue *q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0a3b138..404cc86 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -75,6 +75,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-07 19:30   ` Alexander Gordeev
  2016-07-04  8:39 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
  2016-07-04 10:30 ` automatic interrupt affinity for MSI/MSI-X capable devices V3 Thomas Gleixner
  13 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 98 ++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 71 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index dc39924..336a346 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -88,7 +88,6 @@ struct nvme_dev {
 	unsigned max_qid;
 	int q_depth;
 	u32 db_stride;
-	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct remove_work;
@@ -201,6 +200,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+	return pci_irq_vector(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector);
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 				unsigned int hctx_idx)
 {
@@ -954,7 +958,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 		spin_unlock_irq(&nvmeq->q_lock);
 		return 1;
 	}
-	vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+	vector = nvmeq_irq(nvmeq);
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -962,7 +966,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
 
 	return 0;
@@ -1069,15 +1072,14 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	return NULL;
 }
 
-static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-							const char *name)
+static int queue_request_irq(struct nvme_queue *nvmeq)
 {
 	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
-	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-				IRQF_SHARED, name, nvmeq);
+		return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+				nvme_irq, IRQF_SHARED, nvmeq->irqname, nvmeq);
+	else
+		return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+				nvmeq->irqname, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1108,7 +1110,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	if (result < 0)
 		goto release_cq;
 
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result < 0)
 		goto release_sq;
 
@@ -1228,7 +1230,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 		goto free_nvmeq;
 
 	nvmeq->cq_vector = 0;
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
 		goto free_nvmeq;
@@ -1376,7 +1378,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int result, i, vecs, nr_io_queues, size;
+	int result, nr_io_queues, size;
 
 	nr_io_queues = num_online_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1411,29 +1413,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	}
 
 	/* Deregister the admin queue's interrupt */
-	free_irq(dev->entry[0].vector, adminq);
+	free_irq(pci_irq_vector(pdev, 0), adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
-
-	for (i = 0; i < nr_io_queues; i++)
-		dev->entry[i].entry = i;
-	vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-	if (vecs < 0) {
-		vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-		if (vecs < 0) {
-			vecs = 1;
-		} else {
-			for (i = 0; i < vecs; i++)
-				dev->entry[i].vector = i + pdev->irq;
-		}
-	}
+	pci_free_irq_vectors(pdev);
+	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues, 0);
+	if (nr_io_queues <= 0)
+		return -EIO;
+	dev->max_qid = nr_io_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -1441,10 +1431,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * path to scale better, even if the receive path is limited by the
 	 * number of interrupts.
 	 */
-	nr_io_queues = vecs;
-	dev->max_qid = nr_io_queues;
 
-	result = queue_request_irq(dev, adminq, adminq->irqname);
+	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
 		goto free_queues;
@@ -1456,23 +1444,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static void nvme_pci_post_scan(struct nvme_ctrl *ctrl)
-{
-	struct nvme_dev *dev = to_nvme_dev(ctrl);
-	struct nvme_queue *nvmeq;
-	int i;
-
-	for (i = 0; i < dev->online_queues; i++) {
-		nvmeq = dev->queues[i];
-
-		if (!nvmeq->tags || !(*nvmeq->tags))
-			continue;
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-					blk_mq_tags_cpumask(*nvmeq->tags));
-	}
-}
-
 static void nvme_del_queue_end(struct request *req, int error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
@@ -1575,6 +1546,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.cmd_size = nvme_cmd_size(dev);
 		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->tagset.driver_data = dev;
+		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
@@ -1614,15 +1586,9 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 	 * interrupts. Pre-enable a single MSIX or MSI vec for setup. We'll
 	 * adjust this later.
 	 */
-	if (pci_enable_msix(pdev, dev->entry, 1)) {
-		pci_enable_msi(pdev);
-		dev->entry[0].vector = pdev->irq;
-	}
-
-	if (!dev->entry[0].vector) {
-		result = -ENODEV;
-		goto disable;
-	}
+	result = pci_alloc_irq_vectors(pdev, 1, 1, 0);
+	if (result < 0)
+		return result;
 
 	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 
@@ -1664,10 +1630,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
+	pci_free_irq_vectors(pdev);
 
 	if (pci_is_enabled(pdev)) {
 		pci_disable_pcie_error_reporting(pdev);
@@ -1736,7 +1699,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	if (dev->ctrl.admin_q)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 }
 
@@ -1879,7 +1841,6 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read64		= nvme_pci_reg_read64,
 	.reset_ctrl		= nvme_pci_reset_ctrl,
 	.free_ctrl		= nvme_pci_free_ctrl,
-	.post_scan		= nvme_pci_post_scan,
 	.submit_async_event	= nvme_pci_submit_async_event,
 };
 
@@ -1916,10 +1877,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
-							GFP_KERNEL, node);
-	if (!dev->entry)
-		goto free;
 	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
 							GFP_KERNEL, node);
 	if (!dev->queues)
@@ -1960,7 +1917,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_unmap(dev);
  free:
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 	return result;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 13/13] nvme: remove the post_scan callout
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
@ 2016-07-04  8:39 ` Christoph Hellwig
  2016-07-04 10:30 ` automatic interrupt affinity for MSI/MSI-X capable devices V3 Thomas Gleixner
  13 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:39 UTC (permalink / raw)
  To: tglx, axboe; +Cc: agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 3 ---
 drivers/nvme/host/nvme.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9d7cee4..b100cce 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1627,9 +1627,6 @@ static void nvme_scan_work(struct work_struct *work)
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 	mutex_unlock(&ctrl->namespaces_mutex);
 	kfree(id);
-
-	if (ctrl->ops->post_scan)
-		ctrl->ops->post_scan(ctrl);
 }
 
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 282421f..bb343b3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -150,7 +150,6 @@ struct nvme_ctrl_ops {
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 	int (*reset_ctrl)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
-	void (*post_scan)(struct nvme_ctrl *ctrl);
 	void (*submit_async_event)(struct nvme_ctrl *ctrl, int aer_idx);
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: automatic interrupt affinity for MSI/MSI-X capable devices V3
  2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2016-07-04  8:39 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
@ 2016-07-04 10:30 ` Thomas Gleixner
  13 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2016-07-04 10:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, agordeev, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, 4 Jul 2016, Christoph Hellwig wrote:

> This series enhances the irq and PCI code to allow spreading around MSI and
> MSI-X vectors so that they have per-cpu affinity if possible, or at least
> per-node.  For that it takes the algorithm from blk-mq, moves it to
> a common place, and makes it available through a vastly simplified PCI
> interrupt allocation API.  It then switches blk-mq to be able to pick up
> the queue mapping from the device if available, and demonstrates all this
> using the NVMe driver.
> 
> Compared to the last posting the core IRQ changes are stable and it would
> be great to get them merged int the tip tree.  The two PCI patches have
> been completely rewritten after feedback from Alexander, while the block
> changes have also been stable.

I applied the first 6 patches. They are available in a seperate branch:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/for-block

and integrated into the irq/core branch. irq/for-block can be pulled from the
block folks if required.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
  2016-07-04  8:39 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
@ 2016-07-04 10:32   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-07-04 10:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, bart.vanassche, mingo, tglx, linux-kernel, hch

Commit-ID:  b6140914fd079e43ea75a53429b47128584f033a
Gitweb:     http://git.kernel.org/tip/b6140914fd079e43ea75a53429b47128584f033a
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 4 Jul 2016 17:39:22 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:12 +0200

genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP

No user and we definitely don't want to grow one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/msi.h | 6 ++----
 kernel/irq/msi.c    | 8 ++------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8b425c6..c33abfa 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -264,12 +264,10 @@ enum {
 	 * callbacks.
 	 */
 	MSI_FLAG_USE_DEF_CHIP_OPS	= (1 << 1),
-	/* Build identity map between hwirq and irq */
-	MSI_FLAG_IDENTITY_MAP		= (1 << 2),
 	/* Support multiple PCI MSI interrupts */
-	MSI_FLAG_MULTI_PCI_MSI		= (1 << 3),
+	MSI_FLAG_MULTI_PCI_MSI		= (1 << 2),
 	/* Support PCI MSIX interrupts */
-	MSI_FLAG_PCI_MSIX		= (1 << 4),
+	MSI_FLAG_PCI_MSIX		= (1 << 3),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 38e89ce..eb5bf2b 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -324,7 +324,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_domain_ops *ops = info->ops;
 	msi_alloc_info_t arg;
 	struct msi_desc *desc;
-	int i, ret, virq = -1;
+	int i, ret, virq;
 
 	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
 	if (ret)
@@ -332,12 +332,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	for_each_msi_entry(desc, dev) {
 		ops->set_desc(&arg, desc);
-		if (info->flags & MSI_FLAG_IDENTITY_MAP)
-			virq = (int)ops->get_hwirq(info, &arg);
-		else
-			virq = -1;
 
-		virq = __irq_domain_alloc_irqs(domain, virq, desc->nvec_used,
+		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false);
 		if (virq < 0) {
 			ret = -ENOSPC;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-07-04  8:39 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
@ 2016-07-04 10:32   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-07-04 10:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, mingo, tglx, linux-kernel, hch

Commit-ID:  9c2555835bb3d34dfac52a0be943dcc4bedd650f
Gitweb:     http://git.kernel.org/tip/9c2555835bb3d34dfac52a0be943dcc4bedd650f
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 4 Jul 2016 17:39:23 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:13 +0200

genirq: Introduce IRQD_AFFINITY_MANAGED flag

Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/irq.h    |  7 +++++++
 kernel/irq/internals.h |  2 ++
 kernel/irq/manage.c    | 21 ++++++++++++++++++---
 kernel/irq/proc.c      |  2 +-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 4d758a7..f607481 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -197,6 +197,7 @@ struct irq_data {
  * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
  * IRQD_WAKEUP_ARMED		- Wakeup mode armed
  * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
+ * IRQD_AFFINITY_MANAGED	- Affinity is auto-managed by the kernel
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -212,6 +213,7 @@ enum {
 	IRQD_IRQ_INPROGRESS		= (1 << 18),
 	IRQD_WAKEUP_ARMED		= (1 << 19),
 	IRQD_FORWARDED_TO_VCPU		= (1 << 20),
+	IRQD_AFFINITY_MANAGED		= (1 << 21),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -305,6 +307,11 @@ static inline void irqd_clr_forwarded_to_vcpu(struct irq_data *d)
 	__irqd_to_state(d) &= ~IRQD_FORWARDED_TO_VCPU;
 }
 
+static inline bool irqd_affinity_is_managed(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 09be2c9..b15aa3b 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -105,6 +105,8 @@ static inline void unregister_handler_proc(unsigned int irq,
 					   struct irqaction *action) { }
 #endif
 
+extern bool irq_can_set_affinity_usr(unsigned int irq);
+
 extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask);
 
 extern void irq_set_thread_affinity(struct irq_desc *desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ef0bc02..30658e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -115,12 +115,12 @@ EXPORT_SYMBOL(synchronize_irq);
 #ifdef CONFIG_SMP
 cpumask_var_t irq_default_affinity;
 
-static int __irq_can_set_affinity(struct irq_desc *desc)
+static bool __irq_can_set_affinity(struct irq_desc *desc)
 {
 	if (!desc || !irqd_can_balance(&desc->irq_data) ||
 	    !desc->irq_data.chip || !desc->irq_data.chip->irq_set_affinity)
-		return 0;
-	return 1;
+		return false;
+	return true;
 }
 
 /**
@@ -134,6 +134,21 @@ int irq_can_set_affinity(unsigned int irq)
 }
 
 /**
+ * irq_can_set_affinity_usr - Check if affinity of a irq can be set from user space
+ * @irq:	Interrupt to check
+ *
+ * Like irq_can_set_affinity() above, but additionally checks for the
+ * AFFINITY_MANAGED flag.
+ */
+bool irq_can_set_affinity_usr(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	return __irq_can_set_affinity(desc) &&
+		!irqd_affinity_is_managed(&desc->irq_data);
+}
+
+/**
  *	irq_set_thread_affinity - Notify irq threads to adjust affinity
  *	@desc:		irq descriptor which has affitnity changed
  *
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4e1b947..40bdcdc 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -96,7 +96,7 @@ static ssize_t write_irq_affinity(int type, struct file *file,
 	cpumask_var_t new_value;
 	int err;
 
-	if (!irq_can_set_affinity(irq) || no_irq_affinity)
+	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
 		return -EIO;
 
 	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq: Add affinity hint to irq allocation
  2016-07-04  8:39 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
@ 2016-07-04 10:33   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-07-04 10:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hch, hpa, tglx, mingo

Commit-ID:  06ee6d571f0e350253a8fc3492316b2be007fae2
Gitweb:     http://git.kernel.org/tip/06ee6d571f0e350253a8fc3492316b2be007fae2
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 4 Jul 2016 17:39:24 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:13 +0200

genirq: Add affinity hint to irq allocation

Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/sparc/kernel/irq_64.c     |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h            |  4 ++--
 include/linux/irqdomain.h      |  9 ++++++---
 kernel/irq/ipi.c               |  2 +-
 kernel/irq/irqdesc.c           | 12 ++++++++----
 kernel/irq/irqdomain.c         | 22 ++++++++++++++--------
 kernel/irq/manage.c            |  7 ++++---
 kernel/irq/msi.c               |  3 ++-
 9 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..34a7930 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
 {
 	int irq;
 
-	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, NULL);
 	if (irq <= 0)
 		goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 446702e..7c4f90d 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi,
 
 	return __irq_domain_alloc_irqs(domain, irq, 1,
 				       ioapic_alloc_attr_node(info),
-				       info, legacy);
+				       info, legacy, NULL);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain,
 					  info->ioapic_pin))
 			return -ENOMEM;
 	} else {
-		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+					      NULL);
 		if (irq >= 0) {
 			irq_data = irq_domain_get_irq_data(domain, irq);
 			data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f607481..39ce46a 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -708,11 +708,11 @@ static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		struct module *owner);
+		      struct module *owner, const struct cpumask *affinity);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
-	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index f1f36e0..1aee0fb 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -39,6 +39,7 @@ struct irq_domain;
 struct of_device_id;
 struct irq_chip;
 struct irq_data;
+struct cpumask;
 
 /* Number of irqs reserved for a legacy isa controller */
 #define NUM_ISA_INTERRUPTS	16
@@ -217,7 +218,8 @@ extern struct irq_domain *irq_find_matching_fwspec(struct irq_fwspec *fwspec,
 						   enum irq_domain_bus_token bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
-				  irq_hw_number_t hwirq, int node);
+				  irq_hw_number_t hwirq, int node,
+				  const struct cpumask *affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -389,7 +391,7 @@ static inline struct irq_domain *irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 				   unsigned int nr_irqs, int node, void *arg,
-				   bool realloc);
+				   bool realloc, const struct cpumask *affinity);
 extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs);
 extern void irq_domain_activate_irq(struct irq_data *irq_data);
 extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
@@ -397,7 +399,8 @@ extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
 			unsigned int nr_irqs, int node, void *arg)
 {
-	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false);
+	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false,
+				       NULL);
 }
 
 extern int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 89b49f6..4fd2351 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -76,7 +76,7 @@ int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return -ENOMEM;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 8731e1c..b8df4fc 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -223,7 +223,7 @@ static void free_desc(unsigned int irq)
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
-		       struct module *owner)
+		       const struct cpumask *affinity, struct module *owner)
 {
 	struct irq_desc *desc;
 	int i;
@@ -333,6 +333,7 @@ static void free_desc(unsigned int irq)
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
+			      const struct cpumask *affinity,
 			      struct module *owner)
 {
 	u32 i;
@@ -453,12 +454,15 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @cnt:	Number of consecutive irqs to allocate.
  * @node:	Preferred node on which the irq descriptor should be allocated
  * @owner:	Owning module (can be NULL)
+ * @affinity:	Optional pointer to an affinity mask which hints where the
+ *		irq descriptors should be allocated and which default
+ *		affinities to use
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		  struct module *owner)
+		  struct module *owner, const struct cpumask *affinity)
 {
 	int start, ret;
 
@@ -494,7 +498,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
 	bitmap_set(allocated_irqs, start, cnt);
 	mutex_unlock(&sparse_irq_lock);
-	return alloc_descs(start, cnt, node, owner);
+	return alloc_descs(start, cnt, node, affinity, owner);
 
 err:
 	mutex_unlock(&sparse_irq_lock);
@@ -512,7 +516,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_alloc_hwirqs(int cnt, int node)
 {
-	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL);
+	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL, NULL);
 
 	if (irq < 0)
 		return 0;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 8798b6c..79459b7 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -481,7 +481,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node));
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -835,19 +835,23 @@ const struct irq_domain_ops irq_domain_simple_ops = {
 EXPORT_SYMBOL_GPL(irq_domain_simple_ops);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node)
+			   int node, const struct cpumask *affinity)
 {
 	unsigned int hint;
 
 	if (virq >= 0) {
-		virq = irq_alloc_descs(virq, virq, cnt, node);
+		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
+					 affinity);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
-		virq = irq_alloc_descs_from(hint, cnt, node);
-		if (virq <= 0 && hint > 1)
-			virq = irq_alloc_descs_from(1, cnt, node);
+		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
+					 affinity);
+		if (virq <= 0 && hint > 1) {
+			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
+						 affinity);
+		}
 	}
 
 	return virq;
@@ -1160,6 +1164,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  * @node:	NUMA node id for memory allocation
  * @arg:	domain specific argument
  * @realloc:	IRQ descriptors have already been allocated if true
+ * @affinity:	Optional irq affinity mask for multiqueue devices
  *
  * Allocate IRQ numbers and initialized all data structures to support
  * hierarchy IRQ domains.
@@ -1175,7 +1180,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  */
 int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 			    unsigned int nr_irqs, int node, void *arg,
-			    bool realloc)
+			    bool realloc, const struct cpumask *affinity)
 {
 	int i, ret, virq;
 
@@ -1193,7 +1198,8 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 	if (realloc && irq_base >= 0) {
 		virq = irq_base;
 	} else {
-		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node);
+		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
+					      affinity);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 30658e9..ad0aac6 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -353,10 +353,11 @@ static int setup_affinity(struct irq_desc *desc, struct cpumask *mask)
 		return 0;
 
 	/*
-	 * Preserve an userspace affinity setup, but make sure that
-	 * one of the targets is online.
+	 * Preserve the managed affinity setting and an userspace affinity
+	 * setup, but make sure that one of the targets is online.
 	 */
-	if (irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
+	if (irqd_affinity_is_managed(&desc->irq_data) ||
+	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_common_data.affinity,
 				       cpu_online_mask))
 			set = desc->irq_common_data.affinity;
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index eb5bf2b..58dbbac 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -334,7 +334,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
-					       dev_to_node(dev), &arg, false);
+					       dev_to_node(dev), &arg, false,
+					       NULL);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq: Use affinity hint in irqdesc allocation
  2016-07-04  8:39 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
@ 2016-07-04 10:33   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-07-04 10:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hch, mingo, tglx, hpa

Commit-ID:  45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c
Gitweb:     http://git.kernel.org/tip/45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 4 Jul 2016 17:39:25 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:13 +0200

genirq: Use affinity hint in irqdesc allocation

Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/irq/irqdesc.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b8df4fc..a623b44 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -68,9 +68,13 @@ static int alloc_masks(struct irq_desc *desc, gfp_t gfp, int node)
 	return 0;
 }
 
-static void desc_smp_init(struct irq_desc *desc, int node)
+static void desc_smp_init(struct irq_desc *desc, int node,
+			  const struct cpumask *affinity)
 {
-	cpumask_copy(desc->irq_common_data.affinity, irq_default_affinity);
+	if (!affinity)
+		affinity = irq_default_affinity;
+	cpumask_copy(desc->irq_common_data.affinity, affinity);
+
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_clear(desc->pending_mask);
 #endif
@@ -82,11 +86,12 @@ static void desc_smp_init(struct irq_desc *desc, int node)
 #else
 static inline int
 alloc_masks(struct irq_desc *desc, gfp_t gfp, int node) { return 0; }
-static inline void desc_smp_init(struct irq_desc *desc, int node) { }
+static inline void
+desc_smp_init(struct irq_desc *desc, int node, const struct cpumask *affinity) { }
 #endif
 
 static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
-		struct module *owner)
+			      const struct cpumask *affinity, struct module *owner)
 {
 	int cpu;
 
@@ -107,7 +112,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
-	desc_smp_init(desc, node);
+	desc_smp_init(desc, node, affinity);
 }
 
 int nr_irqs = NR_IRQS;
@@ -158,7 +163,9 @@ void irq_unlock_sparse(void)
 	mutex_unlock(&sparse_irq_lock);
 }
 
-static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
+static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
+				   const struct cpumask *affinity,
+				   struct module *owner)
 {
 	struct irq_desc *desc;
 	gfp_t gfp = GFP_KERNEL;
@@ -178,7 +185,8 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
 	lockdep_set_class(&desc->lock, &irq_desc_lock_class);
 	init_rcu_head(&desc->rcu);
 
-	desc_set_defaults(irq, desc, node, owner);
+	desc_set_defaults(irq, desc, node, affinity, owner);
+	irqd_set(&desc->irq_data, flags);
 
 	return desc;
 
@@ -225,11 +233,30 @@ static void free_desc(unsigned int irq)
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		       const struct cpumask *affinity, struct module *owner)
 {
+	const struct cpumask *mask = NULL;
 	struct irq_desc *desc;
-	int i;
+	unsigned int flags;
+	int i, cpu = -1;
+
+	if (affinity && cpumask_empty(affinity))
+		return -EINVAL;
+
+	flags = affinity ? IRQD_AFFINITY_MANAGED : 0;
 
 	for (i = 0; i < cnt; i++) {
-		desc = alloc_desc(start + i, node, owner);
+		if (affinity) {
+			cpu = cpumask_next(cpu, affinity);
+			if (cpu >= nr_cpu_ids)
+				cpu = cpumask_first(affinity);
+			node = cpu_to_node(cpu);
+
+			/*
+			 * For single allocations we use the caller provided
+			 * mask otherwise we use the mask of the target cpu
+			 */
+			mask = cnt == 1 ? affinity : cpumask_of(cpu);
+		}
+		desc = alloc_desc(start + i, node, flags, mask, owner);
 		if (!desc)
 			goto err;
 		mutex_lock(&sparse_irq_lock);
@@ -277,7 +304,7 @@ int __init early_irq_init(void)
 		nr_irqs = initcnt;
 
 	for (i = 0; i < initcnt; i++) {
-		desc = alloc_desc(i, node, NULL);
+		desc = alloc_desc(i, node, 0, NULL, NULL);
 		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
@@ -311,7 +338,7 @@ int __init early_irq_init(void)
 		alloc_masks(&desc[i], GFP_KERNEL, node);
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
-		desc_set_defaults(i, &desc[i], node, NULL);
+		desc_set_defaults(i, &desc[i], node, NULL, NULL);
 	}
 	return arch_early_irq_init();
 }
@@ -328,7 +355,7 @@ static void free_desc(unsigned int irq)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL);
+	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq/msi: Make use of affinity aware allocations
  2016-07-04  8:39 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
@ 2016-07-04 10:33   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-07-04 10:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hch, linux-kernel, tglx, hpa, mingo

Commit-ID:  0972fa57f53525ffa6ced12d703750fc2791e3ce
Gitweb:     http://git.kernel.org/tip/0972fa57f53525ffa6ced12d703750fc2791e3ce
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 4 Jul 2016 17:39:26 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:14 +0200

genirq/msi: Make use of affinity aware allocations

Allow the MSI code to provide affinity hints per MSI descriptor.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-6-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/msi.h | 2 ++
 kernel/irq/msi.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index c33abfa..4f0bfe5 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -47,6 +47,7 @@ struct fsl_mc_msi_desc {
  * @nvec_used:	The number of vectors used
  * @dev:	Pointer to the device which uses this descriptor
  * @msg:	The last set MSI message cached for reuse
+ * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
  *
  * @masked:	[PCI MSI/X] Mask bits
  * @is_msix:	[PCI MSI/X] True if MSI-X
@@ -67,6 +68,7 @@ struct msi_desc {
 	unsigned int			nvec_used;
 	struct device			*dev;
 	struct msi_msg			msg;
+	const struct cpumask		*affinity;
 
 	union {
 		/* PCI MSI/X specific data */
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 58dbbac..0e2a736 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -335,7 +335,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false,
-					       NULL);
+					       desc->affinity);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip:irq/core] genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors
  2016-07-04  8:39 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
@ 2016-07-04 10:34   ` tip-bot for Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Christoph Hellwig @ 2016-07-04 10:34 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hch, tglx, linux-kernel, mingo, hpa

Commit-ID:  5e385a6ef31fbbf2acbda770aecc2bc2ff933d17
Gitweb:     http://git.kernel.org/tip/5e385a6ef31fbbf2acbda770aecc2bc2ff933d17
Author:     Christoph Hellwig <hch@lst.de>
AuthorDate: Mon, 4 Jul 2016 17:39:27 +0900
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 4 Jul 2016 12:25:14 +0200

genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors

This is lifted from the blk-mq code and adopted to use the affinity mask
concept just introduced in the irq handling code.  It tries to keep the
algorithm the same as the one current used by blk-mq, but improvements
like assining vectors on a per-node basis instead of just per sibling
are possible with this simple move and refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-7-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/interrupt.h |  8 +++++++
 kernel/irq/Makefile       |  1 +
 kernel/irq/affinity.c     | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..b6683f0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,8 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
+struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +310,12 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	return 0;
 }
+
+static inline struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
+{
+	*nr_vecs = 1;
+	return NULL;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 0000000..f689593
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,61 @@
+
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+
+static int get_first_sibling(unsigned int cpu)
+{
+	unsigned int ret;
+
+	ret = cpumask_first(topology_sibling_cpumask(cpu));
+	if (ret < nr_cpu_ids)
+		return ret;
+	return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assigning the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
+{
+	struct cpumask *affinity_mask;
+	unsigned int max_vecs = *nr_vecs;
+
+	if (max_vecs == 1)
+		return NULL;
+
+	affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!affinity_mask) {
+		*nr_vecs = 1;
+		return NULL;
+	}
+
+	if (max_vecs >= num_online_cpus()) {
+		cpumask_copy(affinity_mask, cpu_online_mask);
+		*nr_vecs = num_online_cpus();
+	} else {
+		unsigned int vecs = 0, cpu;
+
+		for_each_online_cpu(cpu) {
+			if (cpu == get_first_sibling(cpu)) {
+				cpumask_set_cpu(cpu, affinity_mask);
+				vecs++;
+			}
+
+			if (--max_vecs == 0)
+				break;
+		}
+		*nr_vecs = vecs;
+	}
+
+	return affinity_mask;
+}

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-04  8:39 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-07-06  8:05   ` Alexander Gordeev
  2016-07-10  3:47     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-06  8:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 05:39:28PM +0900, Christoph Hellwig wrote:
> Add a function to allocate a range of interrupt vectors, which will
> transparently use MSI-X and MSI if available or fallback to legacy
> vectors.  The interrupts are available in a core managed array in the
> pci_dev structure, and can also be released using a similar function.
> A new helper, __pci_enable_msix_range, is introduced to allow allocating
> the array of msix descriptors in the core PCIe code at the exact number
> of vectors supported by the PCI host complex, and to also help with the
> automatic IRQ affinity assignment that will be added in the next commit.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  Documentation/PCI/MSI-HOWTO.txt | 437 ++++------------------------------------

[...]

> + pci_enable_msi, pci_enable_msi_range, pci_enable_msi_exact, pci_disable_msi,
> + pci_msi_vec_count, pci_enable_msix_range, pci_enable_msix_exact,
> + pci_disable_msix, pci_msix_vec_count

Description of these functions can be removed when all drivers migrated
to the new API. Also implementation descriptions + examples would still
be needed AFAICT.

[...]

> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -4,6 +4,7 @@
>   *
>   * Copyright (C) 2003-2004 Intel
>   * Copyright (C) Tom Long Nguyen (tom.l.nguyen@intel.com)
> + * Copyright (C) 2016 Christoph Hellwig.
>   */
>  
>  #include <linux/err.h>
> @@ -1120,6 +1121,98 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
>  }
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
> +static int __pci_enable_msix_range(struct pci_dev *dev, unsigned int min_vecs,
> +		unsigned int max_vecs)
> +{
> +	int vecs = max_vecs, ret, i;
> +
> +retry:
> +	if (vecs < min_vecs)
> +		return -ENOSPC;
> +
> +	dev->msix_vectors = kmalloc_array(vecs, sizeof(struct msix_entry),
> +				GFP_KERNEL);
> +	if (!dev->msix_vectors)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < vecs; i++)
> +		dev->msix_vectors[i].entry = i;
> +
> +	ret = pci_enable_msix(dev, dev->msix_vectors, vecs);
> +	if (ret)
> +		goto out_fail;
> +
> +	return vecs;
> +
> +out_fail:
> +	kfree(dev->msix_vectors);
> +	dev->msix_vectors = NULL;
> +
> +	if (ret >= 0) {
> +		/* retry with the actually supported number of vectors */
> +		vecs = ret;
> +		goto retry;
> +	}
> +
> +	return ret;
> +}

This function's code almost matches the existing pci_enable_msix_range()
so pci_enable_msix_range() should be reworked instead IMHO.

I took a look at the existing code and the rework does not appear
necessary as __pci_enable_msix_range() could look like this:

static int __pci_enable_msix_range(struct pci_dev *dev, unsigned int min_vecs,
		unsigned int max_vecs)
{
	int ret, i;
	struct msix_entry *entries;

	entries = kmalloc_array(max_vecs, sizeof(entries[0]), GFP_KERNEL);
	if (!entries)
		return -ENOMEM;

	for (i = 0; i < max_vecs; i++)
		entries[i].entry = i;

	ret = pci_enable_msix_range(dev, entries, min_vecs, max_vecs);

	kfree(entries);

	return ret;
}


We do not need to keep msix_entry array, since it only needed for
pci_irq_vector() function. But the same info could be retrieved from
msi_desc::irq.

> +/**
> + * pci_alloc_irq_vectors - allocate multiple IRQs for a device
> + * @dev:		PCI device to operate on
> + * @min_vecs:		minimum number of vectors required (must be >= 1)
> + * @max_vecs:		maximum (desired) number of vectors
> + * @flags:		flags or quirks for the allocation
> + *
> + * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI
> + * vectors if available, and fall back to a single legacy vector
> + * if neither is available.  Return the number of vectors allocated,
> + * (which might be smaller than @max_vecs) if successful, or a negative
> + * error code on error. If less than @min_vecs interrupt vectors are
> + * available for @dev the function will fail with -ENOSPC.
> + *
> + * To get the Linux irq number used for a vector that can be passed to
> + * request_irq use the pci_irq_vector() helper.
> + */
> +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> +		unsigned int max_vecs, unsigned int flags)
> +{
> +	int vecs;
> +
> +	if (!(flags & PCI_IRQ_NOMSIX)) {
> +		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs);
> +		if (vecs > 0)
> +			return vecs;
> +	}
> +
> +	if (!(flags & PCI_IRQ_NOMSI)) {
> +		vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> +		if (vecs > 0)
> +			return vecs;
> +	}
> +
> +	/* use legacy irq if allowed */
> +	if (min_vecs == 1)
> +		return 1;
> +	return -ENOSPC;

The original error code (in vecs) would be overridden with -ENOSPC here.

> +}
> +EXPORT_SYMBOL(pci_alloc_irq_vectors);
> +
> +/**
> + * pci_free_irq_vectors - free previously allocated IRQs for a device
> + * @dev:		PCI device to operate on
> + *
> + * Undoes the allocations and enabling in pci_alloc_irq_vectors().
> + */
> +void pci_free_irq_vectors(struct pci_dev *dev)
> +{
> +	if (dev->msix_enabled)
> +		kfree(dev->msix_vectors);
> +	pci_disable_msix(dev);
> +	pci_disable_msi(dev);
> +}
> +EXPORT_SYMBOL(pci_free_irq_vectors);
> +
>  struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
>  {
>  	return to_pci_dev(desc->dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b67e4df..129871f 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -320,6 +320,7 @@ struct pci_dev {
>  	 * directly, use the values stored here. They might be different!
>  	 */
>  	unsigned int	irq;
> +	struct msix_entry *msix_vectors;

As suggested above, it is not needed.

>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
>  
>  	bool match_driver;		/* Skip attaching driver */
> @@ -1237,6 +1238,9 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
>  int pci_set_vga_state(struct pci_dev *pdev, bool decode,
>  		      unsigned int command_bits, u32 flags);
>  
> +#define PCI_IRQ_NOMSI		(1 << 0) /* don't try to use MSI interrupts */
> +#define PCI_IRQ_NOMSIX		(1 << 1) /* don't try to use MSI-X interrupts */
> +
>  /* kmem_cache style wrapper around pci_alloc_consistent() */
>  
>  #include <linux/pci-dma.h>
> @@ -1284,6 +1288,24 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  		return rc;
>  	return 0;
>  }
> +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> +		unsigned int max_vecs, unsigned int flags);
> +void pci_free_irq_vectors(struct pci_dev *dev);
> +
> +
> +/**
> + * pci_irq_vector - return Linux IRQ number of a device vector
> + * @dev: PCI device to operate on
> + * @nr: device-relative interrupt vector index (0-based).
> + */
> +static inline unsigned int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
> +{
> +	if (dev->msix_enabled)
> +		return dev->msix_vectors[nr].vector;

So here you could search with for_each_pci_msi_entry() instead.


> +	WARN_ON_ONCE(!dev->msi_enabled && nr > 0);
> +	return dev->irq + nr;

I think this function should check irq number existence and return the
vector number or -EINVAL;

> +}
>  #else
>  static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
>  static inline void pci_msi_shutdown(struct pci_dev *dev) { }
> @@ -1307,6 +1329,23 @@ static inline int pci_enable_msix_range(struct pci_dev *dev,
>  static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  		      struct msix_entry *entries, int nvec)
>  { return -ENOSYS; }
> +static inline int pci_alloc_irq_vectors(struct pci_dev *dev,
> +		unsigned int min_vecs, unsigned int max_vecs,
> +		unsigned int flags)
> +{
> +	if (min_vecs > 1)
> +		return -ENOSPC;

In case CONFIG_PCI_MSI is unset min_vecs > 1 is -EINVAL;

> +	return 1;
> +}
> +static inline void pci_free_irq_vectors(struct pci_dev *dev)
> +{
> +}
> +
> +static inline unsigned int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
> +{
> +	WARN_ON_ONCE(nr > 0);

As suggested above, it would rather -EINVAL;

> +	return dev->irq;
> +}
>  #endif
>  
>  #ifdef CONFIG_PCIEPORTBUS
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-07-04  8:39 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-07-07 11:05   ` Alexander Gordeev
  2016-07-10  3:57     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-07 11:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 05:39:29PM +0900, Christoph Hellwig wrote:
> Set the affinity_mask in the PCI device before allocating vectors so that
> the affinity can be propagated through the MSI descriptor structures to
> the core IRQ code.  Add a new helper __pci_enable_msi_range which is
> similar to __pci_enable_msix_range introduced in the last patch so that
> we can allocate the affinity mask in a self-contained fashion and for the
> right number of vectors.
> 
> A new PCI_IRQ_NOAFFINITY flag is added to pci_alloc_irq_vectors so that
> this function can also be used by drivers that don't wish to use the
> automatic affinity assignment.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  Documentation/PCI/MSI-HOWTO.txt |  3 +-
>  drivers/pci/msi.c               | 72 ++++++++++++++++++++++++++++++++++++++---
>  include/linux/pci.h             |  2 ++
>  3 files changed, 72 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 35d1326..dcd3f6d 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -95,7 +95,8 @@ argument set to this limit, and the PCI core will return -ENOSPC if it can't
>  meet the minimum number of vectors.
>  The flags argument should normally be set to 0, but can be used to
>  pass the PCI_IRQ_NOMSI and PCI_IRQ_NOMSIX flag in case a device claims
> -to support MSI or MSI-X, but the support is broken.
> +to support MSI or MSI-X, but the support is broken, or to PCI_IRQ_NOAFFINITY
> +if the driver does not wish to use the automatic affinity assignment feature.
>  
>  To get the Linux IRQ numbers passed to request_irq and free_irq
>  and the vectors use the following function:
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 6b0834d..7f38e07 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -568,6 +568,7 @@ static struct msi_desc *msi_setup_entry(struct pci_dev *dev, int nvec)
>  	entry->msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
>  	entry->msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
>  	entry->nvec_used		= nvec;
> +	entry->affinity			= dev->irq_affinity;

irq_create_affinity_mask() bails out with no affinity in case of single
vector, but alloc_descs() (see below (*)) assigns the whole affinity
mask. It should be consistent instead.

Actually, I just realized pci_alloc_irq_vectors() should probably call
irq_create_affinity_mask() and handle it in a consistent way for all four
cases: MSI-X, mulit-MSI, MSI and legacy.

Optionally, the three latter could be dropped for now so you could proceed
with NVMe.

>  	if (control & PCI_MSI_FLAGS_64BIT)
>  		entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
> @@ -679,10 +680,18 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
>  static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			      struct msix_entry *entries, int nvec)
>  {
> +	const struct cpumask *mask = NULL;
>  	struct msi_desc *entry;
> -	int i;
> +	int cpu = -1, i;
>  
>  	for (i = 0; i < nvec; i++) {
> +		if (dev->irq_affinity) {
> +			cpu = cpumask_next(cpu, dev->irq_affinity);
> +			if (cpu >= nr_cpu_ids)
> +				cpu = cpumask_first(dev->irq_affinity);
> +			mask = cpumask_of(cpu);
> +		}
> +

(*) In the future IRQ vs CPU mapping 1:N is possible/desirable so I suppose
this piece of code worth a comment or better - a separate function. In fact,
this algorithm already exists in alloc_descs(), which makes even more sense
to factor it out:

	for (i = 0; i < cnt; i++) {
		if (affinity) {
			cpu = cpumask_next(cpu, affinity);
			if (cpu >= nr_cpu_ids)
				cpu = cpumask_first(affinity);
			node = cpu_to_node(cpu);

			/*
			 * For single allocations we use the caller provided
			 * mask otherwise we use the mask of the target cpu
			 */
			mask = cnt == 1 ? affinity : cpumask_of(cpu);
		}

		[...]

	}

>  		entry = alloc_msi_entry(&dev->dev);
>  		if (!entry) {
>  			if (!i)
> @@ -699,6 +708,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  		entry->msi_attrib.default_irq	= dev->irq;
>  		entry->mask_base		= base;
>  		entry->nvec_used		= 1;
> +		entry->affinity			= mask;
>  
>  		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
>  	}
> @@ -1121,8 +1131,53 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
>  }
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
> +static int __pci_enable_msi_range(struct pci_dev *dev, int min_vecs, int max_vecs,

So I won't comment on this function before the above is decided.

> +		unsigned int flags)
> +{
> +	int vecs, ret;
> +
> +	if (!pci_msi_supported(dev, min_vecs))
> +		return -EINVAL;
> +
> +	vecs = pci_msi_vec_count(dev);
> +	if (vecs < 0)
> +		return vecs;
> +	if (vecs < min_vecs)
> +		return -EINVAL;
> +	if (vecs > max_vecs)
> +		vecs = max_vecs;
> +
> +retry:
> +	if (vecs < min_vecs)
> +		return -ENOSPC;
> +
> +	if (!(flags & PCI_IRQ_NOAFFINITY)) {
> +		dev->irq_affinity = irq_create_affinity_mask(&vecs);
> +		if (vecs < min_vecs) {
> +			ret = -ERANGE;
> +			goto out_fail;
> +		}
> +	}
> +
> +	ret = msi_capability_init(dev, vecs);
> +	if (ret)
> +		goto out_fail;
> +
> +	return vecs;
> +
> +out_fail:
> +	kfree(dev->irq_affinity);
> +	if (ret >= 0) {
> +		/* retry with the actually supported number of vectors */
> +		vecs = ret;
> +		goto retry;
> +	}
> +
> +	return ret;
> +}
> +
>  static int __pci_enable_msix_range(struct pci_dev *dev, unsigned int min_vecs,
> -		unsigned int max_vecs)
> +		unsigned int max_vecs, unsigned int flags)
>  {
>  	int vecs = max_vecs, ret, i;
>  
> @@ -1138,6 +1193,13 @@ retry:
>  	for (i = 0; i < vecs; i++)
>  		dev->msix_vectors[i].entry = i;
>  
> +	if (!(flags & PCI_IRQ_NOAFFINITY)) {
> +		dev->irq_affinity = irq_create_affinity_mask(&vecs);
> +		ret = -ENOSPC;
> +		if (vecs < min_vecs)
> +			goto out_fail;
> +	}
> +
>  	ret = pci_enable_msix(dev, dev->msix_vectors, vecs);
>  	if (ret)
>  		goto out_fail;
> @@ -1147,6 +1209,8 @@ retry:
>  out_fail:
>  	kfree(dev->msix_vectors);
>  	dev->msix_vectors = NULL;
> +	kfree(dev->irq_affinity);
> +	dev->irq_affinity = NULL;
>  
>  	if (ret >= 0) {
>  		/* retry with the actually supported number of vectors */
> @@ -1180,13 +1244,13 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>  	int vecs;
>  
>  	if (!(flags & PCI_IRQ_NOMSIX)) {
> -		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs);
> +		vecs = __pci_enable_msix_range(dev, min_vecs, max_vecs, flags);
>  		if (vecs > 0)
>  			return vecs;
>  	}
>  
>  	if (!(flags & PCI_IRQ_NOMSI)) {
> -		vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> +		vecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, flags);
>  		if (vecs > 0)
>  			return vecs;
>  	}
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 129871f..6a64c54 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>  	 */
>  	unsigned int	irq;
>  	struct msix_entry *msix_vectors;
> +	struct cpumask	*irq_affinity;
>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
>  
>  	bool match_driver;		/* Skip attaching driver */
> @@ -1240,6 +1241,7 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
>  
>  #define PCI_IRQ_NOMSI		(1 << 0) /* don't try to use MSI interrupts */
>  #define PCI_IRQ_NOMSIX		(1 << 1) /* don't try to use MSI-X interrupts */
> +#define PCI_IRQ_NOAFFINITY	(1 << 2) /* don't auto-assign affinity */
>  
>  /* kmem_cache style wrapper around pci_alloc_consistent() */
>  
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors
  2016-07-04  8:39 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
@ 2016-07-07 19:30   ` Alexander Gordeev
  2016-07-10  3:59     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-07 19:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 05:39:33PM +0900, Christoph Hellwig wrote:
> @@ -1575,6 +1546,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  		dev->tagset.cmd_size = nvme_cmd_size(dev);
>  		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>  		dev->tagset.driver_data = dev;
> +		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
>  
>  		if (blk_mq_alloc_tag_set(&dev->tagset))
>  			return 0;

Are there any post-init uses of blk_mq_tag_set::affinity_mask other than
calling to blk_mq_alloc_tag_set()? If no, blk_mq_tag_set::affinity_mask
is redundant, since the mask could be passed as a parameter.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-06  8:05   ` Alexander Gordeev
@ 2016-07-10  3:47     ` Christoph Hellwig
  2016-07-11 10:43       ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-10  3:47 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jul 06, 2016 at 10:05:45AM +0200, Alexander Gordeev wrote:
> > + pci_enable_msi, pci_enable_msi_range, pci_enable_msi_exact, pci_disable_msi,
> > + pci_msi_vec_count, pci_enable_msix_range, pci_enable_msix_exact,
> > + pci_disable_msix, pci_msix_vec_count
> 
> Description of these functions can be removed when all drivers migrated
> to the new API. Also implementation descriptions + examples would still
> be needed AFAICT.

I diagreed - if we deprecated functions the only thing that should
be mentioned is a "don't use these". 

> This function's code almost matches the existing pci_enable_msix_range()
> so pci_enable_msix_range() should be reworked instead IMHO.

That's what earlier versions of the code did.  However due to the
fact that we want to avoid over-allocating the msix_vectors array
(minor) and get the vectors count of the affinity mask right (major,
as pointed out by you last time) I had to move the allocations inside
the helpers that loop around the atctual enablement.  I didn't want
to change the function to a different version of the algorithm just
before removing them relatively soon.  But given that strong preference
for changing these simple functions instead of duplicating them I've
changed that patch to do that now.

> We do not need to keep msix_entry array, since it only needed for
> pci_irq_vector() function. But the same info could be retrieved from
> msi_desc::irq.

Indeed.  Avoiding this allocation makes these interfaces quite a bit
simpler.  It requires a few prep patches, but I think it's definitively
worth, so the next version will avoid the need for the msix_entry array.

> > +	/* use legacy irq if allowed */
> > +	if (min_vecs == 1)
> > +		return 1;
> > +	return -ENOSPC;
> 
> The original error code (in vecs) would be overridden with -ENOSPC here.

Ok, fixed.

> > +	WARN_ON_ONCE(!dev->msi_enabled && nr > 0);
> > +	return dev->irq + nr;
> 
> I think this function should check irq number existence and return the
> vector number or -EINVAL;

Ok, fixed.

> > +		unsigned int flags)
> > +{
> > +	if (min_vecs > 1)
> > +		return -ENOSPC;
> 
> In case CONFIG_PCI_MSI is unset min_vecs > 1 is -EINVAL;

Ok, fixed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-07-07 11:05   ` Alexander Gordeev
@ 2016-07-10  3:57     ` Christoph Hellwig
  2016-07-12  6:49       ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-10  3:57 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Thu, Jul 07, 2016 at 01:05:01PM +0200, Alexander Gordeev wrote:
> irq_create_affinity_mask() bails out with no affinity in case of single
> vector, but alloc_descs() (see below (*)) assigns the whole affinity
> mask. It should be consistent instead.

I don't understand the comment.  If we only have one vector (of any
kinds) there is no need to create an affinity mask, we'll leave the
interrupt to the existing irq balancing code.

> Actually, I just realized pci_alloc_irq_vectors() should probably call
> irq_create_affinity_mask() and handle it in a consistent way for all four
> cases: MSI-X, mulit-MSI, MSI and legacy.

That's what the earlier versions did, but you correctly pointed out
that we should call irq_create_affinity_mask only after we have reduced
the number of vectors to the number that the bridges can route, i.e.
that we have to move it into the pci_enable_msi(x)_range main loop.

> Optionally, the three latter could be dropped for now so you could proceed
> with NVMe.

NVMe cares for all these cases at least in theory.

> (*) In the future IRQ vs CPU mapping 1:N is possible/desirable so I suppose
> this piece of code worth a comment or better - a separate function. In fact,
> this algorithm already exists in alloc_descs(), which makes even more sense
> to factor it out:
> 
> 	for (i = 0; i < cnt; i++) {
> 		if (affinity) {
> 			cpu = cpumask_next(cpu, affinity);
> 			if (cpu >= nr_cpu_ids)
> 				cpu = cpumask_first(affinity);
> 			node = cpu_to_node(cpu);
> 
> 			/*
> 			 * For single allocations we use the caller provided
> 			 * mask otherwise we use the mask of the target cpu
> 			 */
> 			mask = cnt == 1 ? affinity : cpumask_of(cpu);
> 		}
> 
> 		[...]

While these two pieces of code look very similar there is an important
difference in why and how the mask is calculated.  In alloc_descs()
the difference here is that cnt = 1 is the MSI-X case where the
passed in affinity is that for the MSI-X descriptor which is for
a single vector.  in the MSI case where we have multiple vectors per
descriptor a different affinity is asigned for each vector based
of a single passed in mask.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors
  2016-07-07 19:30   ` Alexander Gordeev
@ 2016-07-10  3:59     ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-10  3:59 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Thu, Jul 07, 2016 at 09:30:19PM +0200, Alexander Gordeev wrote:
> On Mon, Jul 04, 2016 at 05:39:33PM +0900, Christoph Hellwig wrote:
> > @@ -1575,6 +1546,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
> >  		dev->tagset.cmd_size = nvme_cmd_size(dev);
> >  		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
> >  		dev->tagset.driver_data = dev;
> > +		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
> >  
> >  		if (blk_mq_alloc_tag_set(&dev->tagset))
> >  			return 0;
> 
> Are there any post-init uses of blk_mq_tag_set::affinity_mask other than
> calling to blk_mq_alloc_tag_set()? If no, blk_mq_tag_set::affinity_mask
> is redundant, since the mask could be passed as a parameter.

We'll have to look at it in the block code when reinitializing rebuilding
the queue topology.  This isn't currently done, but we'll need it rather
soon.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-10  3:47     ` Christoph Hellwig
@ 2016-07-11 10:43       ` Alexander Gordeev
  2016-07-12  9:13         ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-11 10:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Sun, Jul 10, 2016 at 05:47:37AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 06, 2016 at 10:05:45AM +0200, Alexander Gordeev wrote:
> > > + pci_enable_msi, pci_enable_msi_range, pci_enable_msi_exact, pci_disable_msi,
> > > + pci_msi_vec_count, pci_enable_msix_range, pci_enable_msix_exact,
> > > + pci_disable_msix, pci_msix_vec_count
> > 
> > Description of these functions can be removed when all drivers migrated
> > to the new API. Also implementation descriptions + examples would still
> > be needed AFAICT.
> 
> I diagreed - if we deprecated functions the only thing that should
> be mentioned is a "don't use these". 

I will try to paraphrase myself. The new API deprecates pci_enable_msi*_range
functions, but I am not that sure about others. Certainly, pci_msi*_vec_count
and pci_enable_msi*_exact could have (and AFAIR do have) uses that can not be
covered by automatic initialization of pci_alloc_irq_vectors().

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-07-10  3:57     ` Christoph Hellwig
@ 2016-07-12  6:49       ` Alexander Gordeev
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-12  6:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Sun, Jul 10, 2016 at 05:57:51AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 07, 2016 at 01:05:01PM +0200, Alexander Gordeev wrote:
> > irq_create_affinity_mask() bails out with no affinity in case of single
> > vector, but alloc_descs() (see below (*)) assigns the whole affinity
> > mask. It should be consistent instead.
> 
> I don't understand the comment.  If we only have one vector (of any
> kinds) there is no need to create an affinity mask, we'll leave the
> interrupt to the existing irq balancing code.

I got impression the affinity mask is assigned differently i.e. for
single MSI mode vs legacy mode. But I might be wrong here - I have to
look into the code again. Anyway, if it is the case it could be
addressed afterwards IMHO.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-11 10:43       ` Alexander Gordeev
@ 2016-07-12  9:13         ` Christoph Hellwig
  2016-07-12 12:46           ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-12  9:13 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Mon, Jul 11, 2016 at 12:43:41PM +0200, Alexander Gordeev wrote:
> > I diagreed - if we deprecated functions the only thing that should
> > be mentioned is a "don't use these". 
> 
> I will try to paraphrase myself. The new API deprecates pci_enable_msi*_range
> functions, but I am not that sure about others. Certainly, pci_msi*_vec_count
> and pci_enable_msi*_exact could have (and AFAIR do have) uses that can not be
> covered by automatic initialization of pci_alloc_irq_vectors().

pci_enable_msi*_exact is the equivalent of pci_enable_msi*_range
with minvecs == maxvecs and treating any return value >= 0 as 0.

I've updated the documentation so that the old usage examples are kept
around, but now use pci_alloc_irq_vectors.  I've also added a more detaild
blurb on pci_msi*_vec_count - I think there is no need for them, but
if I'm proven wrong we'll have to add a pci_irq_vector_count that handles
all interrupt types later.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-07-12  9:13         ` Christoph Hellwig
@ 2016-07-12 12:46           ` Alexander Gordeev
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-12 12:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jul 12, 2016 at 11:13:00AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 11, 2016 at 12:43:41PM +0200, Alexander Gordeev wrote:
> > > I diagreed - if we deprecated functions the only thing that should
> > > be mentioned is a "don't use these". 
> > 
> > I will try to paraphrase myself. The new API deprecates pci_enable_msi*_range
> > functions, but I am not that sure about others. Certainly, pci_msi*_vec__ount
> > and pci_enable_msi*_exact could have (and AFAIR do have) uses that can not be
> > covered by automatic initialization of pci_alloc_irq_vectors().
> 
> pci_enable_msi*_exact is the equivalent of pci_enable_msi*_range
> with minvecs == maxvecs and treating any return value >= 0 as 0.

Right. And people asked explicitly to introduce these helpers when
range functions were introduced in the first place. Since there is
handful of drivers that do use pci_enable_msi*_exact() I suppose a
need for them persists.

> I've updated the documentation so that the old usage examples are kept
> around, but now use pci_alloc_irq_vectors.  I've also added a more detaild
> blurb on pci_msi*_vec_count - I think there is no need for them, but
> if I'm proven wrong we'll have to add a pci_irq_vector_count that handles
> all interrupt types later.

I guess, it is up to Bjorn. But.

Your proposed pci_nr_irq_vectors() function (a) is not a replacement for
pci_msi*_vec_count() and (b) would be useless if I read its description
properly:

(a) Functions pci_msi*_vec_count() return number of vectors reported by
a PCI device. It is a constant for the device and a driver may make an
assumption based on this number;

(b) A number returned by pci_nr_irq_vectors() is not guaranteed what a
following call to pci_alloc_irq_vectors() can return (since the number
of actually allocated vectors might change between the two calls).
Therefore, a value returned by pci_nr_irq_vectors() can not be used for
anything.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-10  3:41         ` Christoph Hellwig
@ 2016-07-12  6:42           ` Alexander Gordeev
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-12  6:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Sun, Jul 10, 2016 at 05:41:44AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 04, 2016 at 11:35:28AM +0200, Alexander Gordeev wrote:
> > > mq_map is initialized to zero already, so we don't really need the
> > > assignment for queue 0.  The reason why this check exists is because
> > > we start with queue = -1 and we never want to assignment -1 to mq_map.
> > 
> > Would this read better then?
> > 
> > 	int queue = 0;
> > 
> > 	...
> > 
> > 	/* If cpus are offline, map them to first hctx */
> > 	for_each_online_cpu(cpu) {
> > 		set->mq_map[cpu] = queue;
> > 		if (cpumask_test_cpu(cpu, affinity_mask))
> > 			queue++;
> 
> It would read better, but I don't think it's actually correct.
> We'd still assign the 'old' queue to the cpu that is set in the affinity
> mask.

To be honest, I fail to see a functional difference, but it is just
a nit anyway.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  9:35       ` Alexander Gordeev
@ 2016-07-10  3:41         ` Christoph Hellwig
  2016-07-12  6:42           ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-10  3:41 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 11:35:28AM +0200, Alexander Gordeev wrote:
> > mq_map is initialized to zero already, so we don't really need the
> > assignment for queue 0.  The reason why this check exists is because
> > we start with queue = -1 and we never want to assignment -1 to mq_map.
> 
> Would this read better then?
> 
> 	int queue = 0;
> 
> 	...
> 
> 	/* If cpus are offline, map them to first hctx */
> 	for_each_online_cpu(cpu) {
> 		set->mq_map[cpu] = queue;
> 		if (cpumask_test_cpu(cpu, affinity_mask))
> 			queue++;

It would read better, but I don't think it's actually correct.
We'd still assign the 'old' queue to the cpu that is set in the affinity
mask.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  8:38     ` Christoph Hellwig
@ 2016-07-04  9:35       ` Alexander Gordeev
  2016-07-10  3:41         ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-04  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 10:38:49AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 04, 2016 at 10:15:41AM +0200, Alexander Gordeev wrote:
> > On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> > > +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> > > +		const struct cpumask *affinity_mask)
> > > +{
> > > +	int queue = -1, cpu = 0;
> > > +
> > > +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> > > +			GFP_KERNEL, set->numa_node);
> > > +	if (!set->mq_map)
> > > +		return -ENOMEM;
> > > +
> > > +	if (!affinity_mask)
> > > +		return 0;	/* map all cpus to queue 0 */
> > > +
> > > +	/* If cpus are offline, map them to first hctx */
> > > +	for_each_online_cpu(cpu) {
> > > +		if (cpumask_test_cpu(cpu, affinity_mask))
> > > +			queue++;
> > 
> > CPUs missing in an affinity mask are mapped to hctxs. Is that intended?
> 
> Yes - each CPU needs to be mapped to some hctx, otherwise we can't
> submit I/O from that CPU.
> 
> > > +		if (queue > 0)
> > 
> > Why this check?
> > 
> > > +			set->mq_map[cpu] = queue;
> 
> mq_map is initialized to zero already, so we don't really need the
> assignment for queue 0.  The reason why this check exists is because
> we start with queue = -1 and we never want to assignment -1 to mq_map.

Would this read better then?

	int queue = 0;

	...

	/* If cpus are offline, map them to first hctx */
	for_each_online_cpu(cpu) {
		set->mq_map[cpu] = queue;
		if (cpumask_test_cpu(cpu, affinity_mask))
			queue++;
	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  8:15   ` Alexander Gordeev
@ 2016-07-04  8:38     ` Christoph Hellwig
  2016-07-04  9:35       ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 10:15:41AM +0200, Alexander Gordeev wrote:
> On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> > +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> > +		const struct cpumask *affinity_mask)
> > +{
> > +	int queue = -1, cpu = 0;
> > +
> > +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> > +			GFP_KERNEL, set->numa_node);
> > +	if (!set->mq_map)
> > +		return -ENOMEM;
> > +
> > +	if (!affinity_mask)
> > +		return 0;	/* map all cpus to queue 0 */
> > +
> > +	/* If cpus are offline, map them to first hctx */
> > +	for_each_online_cpu(cpu) {
> > +		if (cpumask_test_cpu(cpu, affinity_mask))
> > +			queue++;
> 
> CPUs missing in an affinity mask are mapped to hctxs. Is that intended?

Yes - each CPU needs to be mapped to some hctx, otherwise we can't
submit I/O from that CPU.

> > +		if (queue > 0)
> 
> Why this check?
> 
> > +			set->mq_map[cpu] = queue;

mq_map is initialized to zero already, so we don't really need the
assignment for queue 0.  The reason why this check exists is because
we start with queue = -1 and we never want to assignment -1 to mq_map.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
@ 2016-07-04  8:15   ` Alexander Gordeev
  2016-07-04  8:38     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Gordeev @ 2016-07-04  8:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> +		const struct cpumask *affinity_mask)
> +{
> +	int queue = -1, cpu = 0;
> +
> +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> +			GFP_KERNEL, set->numa_node);
> +	if (!set->mq_map)
> +		return -ENOMEM;
> +
> +	if (!affinity_mask)
> +		return 0;	/* map all cpus to queue 0 */
> +
> +	/* If cpus are offline, map them to first hctx */
> +	for_each_online_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, affinity_mask))
> +			queue++;

CPUs missing in an affinity mask are mapped to hctxs. Is that intended?

> +		if (queue > 0)

Why this check?

> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-07-04  8:15   ` Alexander Gordeev
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in
behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  72 ++++++++++++++++++++++++++---
 block/blk-mq.h         |   8 ----
 include/linux/blk-mq.h |   1 +
 5 files changed, 69 insertions(+), 134 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 622cb22..6027a49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1954,6 +1955,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2253,6 +2270,30 @@ struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
 }
 EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 
+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+		const struct cpumask *affinity_mask)
+{
+	int queue = -1, cpu = 0;
+
+	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+			GFP_KERNEL, set->numa_node);
+	if (!set->mq_map)
+		return -ENOMEM;
+
+	if (!affinity_mask)
+		return 0;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue > 0)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2261,6 +2302,8 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
+	int ret;
+
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
 	if (!set->nr_hw_queues)
@@ -2299,11 +2342,30 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	set->mq_map = blk_mq_make_queue_map(set);
-	if (!set->mq_map)
-		goto out_free_tags;
+	/*
+	 * Use the passed in affinity mask if the driver provided one.
+	 */
+	if (set->affinity_mask) {
+		ret = blk_mq_create_mq_map(set, set->affinity_mask);
+		if (!set->mq_map)
+			goto out_free_tags;
+	} else {
+		struct cpumask *affinity_mask;
 
-	if (blk_mq_alloc_rq_maps(set))
+		ret = irq_create_affinity_mask(&affinity_mask,
+					       &set->nr_hw_queues);
+		if (ret)
+			goto out_free_tags;
+
+		ret = blk_mq_create_mq_map(set, affinity_mask);
+		kfree(affinity_mask);
+
+		if (!set->mq_map)
+			goto out_free_tags;
+	}
+
+	ret = blk_mq_alloc_rq_maps(set);
+	if (ret)
 		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
@@ -2317,7 +2379,7 @@ out_free_mq_map:
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..fe7e21f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,14 +45,6 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
- */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
-/*
  * sysfs helpers
  */
 extern int blk_mq_sysfs_register(struct request_queue *q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0a3b138..404cc86 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -75,6 +75,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-07-12 12:41 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-04  8:39 automatic interrupt affinity for MSI/MSI-X capable devices V3 Christoph Hellwig
2016-07-04  8:39 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
2016-07-04 10:32   ` [tip:irq/core] genirq/msi: " tip-bot for Thomas Gleixner
2016-07-04  8:39 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
2016-07-04 10:32   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
2016-07-04  8:39 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
2016-07-04 10:33   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
2016-07-04  8:39 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
2016-07-04 10:33   ` [tip:irq/core] genirq: " tip-bot for Thomas Gleixner
2016-07-04  8:39 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
2016-07-04 10:33   ` [tip:irq/core] genirq/msi: " tip-bot for Thomas Gleixner
2016-07-04  8:39 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
2016-07-04 10:34   ` [tip:irq/core] genirq: Add a helper to " tip-bot for Christoph Hellwig
2016-07-04  8:39 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
2016-07-06  8:05   ` Alexander Gordeev
2016-07-10  3:47     ` Christoph Hellwig
2016-07-11 10:43       ` Alexander Gordeev
2016-07-12  9:13         ` Christoph Hellwig
2016-07-12 12:46           ` Alexander Gordeev
2016-07-04  8:39 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
2016-07-07 11:05   ` Alexander Gordeev
2016-07-10  3:57     ` Christoph Hellwig
2016-07-12  6:49       ` Alexander Gordeev
2016-07-04  8:39 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
2016-07-04  8:39 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
2016-07-04  8:39 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
2016-07-04  8:39 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
2016-07-07 19:30   ` Alexander Gordeev
2016-07-10  3:59     ` Christoph Hellwig
2016-07-04  8:39 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
2016-07-04 10:30 ` automatic interrupt affinity for MSI/MSI-X capable devices V3 Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
2016-07-04  8:15   ` Alexander Gordeev
2016-07-04  8:38     ` Christoph Hellwig
2016-07-04  9:35       ` Alexander Gordeev
2016-07-10  3:41         ` Christoph Hellwig
2016-07-12  6:42           ` Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).