RFC: automatic interrupt affinity for MSI/MSI-X capable devices

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: automatic interrupt affinity for MSI/MSI-X capable devices
@ 2016-04-16  1:35 Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 1/8] device: Add irq affinity hint cpumask pointer Christoph Hellwig
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node.  For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API.  It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demostrates all this
using the NVMe driver.

There is still some work todo, mostly related to handling PCI hotplug,
more details are in the individual patches.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/8] device: Add irq affinity hint cpumask pointer
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 2/8] genirq: Make use of dev->irq_affinity Christoph Hellwig
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

This optional cpumask will be used by the irq core code to optimize interrupt
allocation and affinity setup for multiqueue devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/device.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 002c597..0270103 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -741,6 +741,8 @@ struct device_dma_parameters {
  * @msi_list:	Hosts MSI descriptors
  * @msi_domain: The generic MSI domain this device is using.
  * @numa_node:	NUMA node this device is close to.
+ * @irq_affinity: Hint for irq affinities and descriptor allocation
+ *		  (optional).
  * @dma_mask:	Dma mask (if dma'ble device).
  * @coherent_dma_mask: Like dma_mask, but for alloc_coherent mapping as not all
  * 		hardware supports 64-bit addresses for consistent allocations
@@ -813,6 +815,8 @@ struct device {
 #ifdef CONFIG_NUMA
 	int		numa_node;	/* NUMA node this device is close to */
 #endif
+
+	struct cpumask	*irq_affinity;
 	u64		*dma_mask;	/* dma mask (if dma'able device) */
 	u64		coherent_dma_mask;/* Like dma_mask, but for
 					     alloc_coherent mappings as
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/8] genirq: Make use of dev->irq_affinity
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 1/8] device: Add irq affinity hint cpumask pointer Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

This allows optimized interrupt allocation and affinity settings for multi
queue devices MSI-X interrupts.

If the device holds a pointer to a cpumask, then this mask is used to:
   - allocate the interrupt descriptor on the proper nodes
   - set the default interrupt affinity for the interrupt

The interrupts are excluded from balancing so the user space balancer cannot
screw with the settings which have been requested by the multiqueue driver.

It's yet not clear how we are going to deal with cpu offlining/onlining. Right
now the affinity will simply break during offline. One option to handle this is:

  If the cpu goes offline, then move the irq to a different cpu on the same
  node. If it's the last cpu on the node or all remaining cpus on that node
  have already a queue we "park" it and reuse it when cpus come online again.

XXX: currently only works properly for MSI-X, not for MSI because MSI
allocates a msi_desc for more than a single vector.

Requested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fucked-up-by: Christoph Hellwig <hch@lst.de>
---
 arch/sparc/kernel/irq_64.c     |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h            |  4 ++--
 include/linux/irqdomain.h      |  8 ++++---
 kernel/irq/ipi.c               |  4 ++--
 kernel/irq/irqdesc.c           | 48 +++++++++++++++++++++++++++---------------
 kernel/irq/irqdomain.c         | 22 ++++++++++++-------
 kernel/irq/msi.c               | 11 ++++++++--
 8 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..437d0f7 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
 {
 	int irq;
 
-	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, -1);
 	if (irq <= 0)
 		goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index fdb0fbf..54267ea 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi,
 
 	return __irq_domain_alloc_irqs(domain, irq, 1,
 				       ioapic_alloc_attr_node(info),
-				       info, legacy);
+				       info, legacy, -1);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain,
 					  info->ioapic_pin))
 			return -ENOMEM;
 	} else {
-		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+					      -1);
 		if (irq >= 0) {
 			irq_data = irq_domain_get_irq_data(domain, irq);
 			data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c4de623..27779d0 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -697,11 +697,11 @@ static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		struct module *owner);
+		      struct module *owner, int targetcpu);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
-	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, -1)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 2aed043..fa24663 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -215,7 +215,8 @@ extern struct irq_domain *irq_find_matching_fwnode(struct fwnode_handle *fwnode,
 						   enum irq_domain_bus_token bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
-				  irq_hw_number_t hwirq, int node);
+				  irq_hw_number_t hwirq, int node,
+				  int targetcpu);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -377,7 +378,7 @@ static inline struct irq_domain *irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 				   unsigned int nr_irqs, int node, void *arg,
-				   bool realloc);
+				   bool realloc, int targetcpu);
 extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs);
 extern void irq_domain_activate_irq(struct irq_data *irq_data);
 extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
@@ -385,7 +386,8 @@ extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
 			unsigned int nr_irqs, int node, void *arg)
 {
-	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false);
+	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false,
+				       -1);
 }
 
 extern int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index c37f34b..c635e61 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -76,14 +76,14 @@ unsigned int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, -1);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return 0;
 	}
 
 	virq = __irq_domain_alloc_irqs(domain, virq, nr_irqs, NUMA_NO_NODE,
-				       (void *) dest, true);
+				       (void *) dest, true, -1);
 
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc hw irqs\n");
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 0ccd028..26d92fa 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -68,9 +68,17 @@ static int alloc_masks(struct irq_desc *desc, gfp_t gfp, int node)
 	return 0;
 }
 
-static void desc_smp_init(struct irq_desc *desc, int node)
+static void desc_smp_init(struct irq_desc *desc, int node, int targetcpu)
 {
-	cpumask_copy(desc->irq_common_data.affinity, irq_default_affinity);
+	if (targetcpu < 0) {
+		cpumask_copy(desc->irq_common_data.affinity,
+			     irq_default_affinity);
+	} else {
+		cpumask_copy(desc->irq_common_data.affinity,
+			     cpumask_of(targetcpu));
+		irq_settings_set_no_balancing(desc);
+		irqd_set(&desc->irq_data, IRQD_NO_BALANCING);
+	}
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_clear(desc->pending_mask);
 #endif
@@ -82,11 +90,11 @@ static void desc_smp_init(struct irq_desc *desc, int node)
 #else
 static inline int
 alloc_masks(struct irq_desc *desc, gfp_t gfp, int node) { return 0; }
-static inline void desc_smp_init(struct irq_desc *desc, int node) { }
+static inline void desc_smp_init(struct irq_desc *desc, int cpu, int node) { }
 #endif
 
-static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
-		struct module *owner)
+static void desc_set_defaults(unsigned int irq, struct irq_desc *desc,
+			      int node, struct module *owner, int targetcpu)
 {
 	int cpu;
 
@@ -107,7 +115,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
-	desc_smp_init(desc, node);
+	desc_smp_init(desc, node, targetcpu);
 }
 
 int nr_irqs = NR_IRQS;
@@ -158,7 +166,8 @@ void irq_unlock_sparse(void)
 	mutex_unlock(&sparse_irq_lock);
 }
 
-static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
+static struct irq_desc *alloc_desc(int irq, int node, struct module *owner,
+		int targetcpu)
 {
 	struct irq_desc *desc;
 	gfp_t gfp = GFP_KERNEL;
@@ -178,7 +187,7 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
 	lockdep_set_class(&desc->lock, &irq_desc_lock_class);
 	init_rcu_head(&desc->rcu);
 
-	desc_set_defaults(irq, desc, node, owner);
+	desc_set_defaults(irq, desc, node, owner, targetcpu);
 
 	return desc;
 
@@ -223,13 +232,16 @@ static void free_desc(unsigned int irq)
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
-		       struct module *owner)
+		       struct module *owner, int targetcpu)
 {
 	struct irq_desc *desc;
 	int i;
 
+	if (targetcpu != -1)
+		node = cpu_to_node(targetcpu);
+
 	for (i = 0; i < cnt; i++) {
-		desc = alloc_desc(start + i, node, owner);
+		desc = alloc_desc(start + i, node, owner, targetcpu);
 		if (!desc)
 			goto err;
 		mutex_lock(&sparse_irq_lock);
@@ -277,7 +289,7 @@ int __init early_irq_init(void)
 		nr_irqs = initcnt;
 
 	for (i = 0; i < initcnt; i++) {
-		desc = alloc_desc(i, node, NULL);
+		desc = alloc_desc(i, node, NULL, -1);
 		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
@@ -311,7 +323,7 @@ int __init early_irq_init(void)
 		alloc_masks(&desc[i], GFP_KERNEL, node);
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
-		desc_set_defaults(i, &desc[i], node, NULL);
+		desc_set_defaults(i, &desc[i], node, NULL, -1);
 	}
 	return arch_early_irq_init();
 }
@@ -328,12 +340,12 @@ static void free_desc(unsigned int irq)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL);
+	desc_set_defaults(irq, desc, -1, irq_desc_get_node(desc), NULL, -1);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
-			      struct module *owner)
+			      struct module *owner, int targetcpu)
 {
 	u32 i;
 
@@ -453,12 +465,14 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @cnt:	Number of consecutive irqs to allocate.
  * @node:	Preferred node on which the irq descriptor should be allocated
  * @owner:	Owning module (can be NULL)
+ * @targetcpu:	CPU number where the irq descriptors should be allocated and
+ *		which should be used for the default affinity.  -1 if not used.
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		  struct module *owner)
+		  struct module *owner, int targetcpu)
 {
 	int start, ret;
 
@@ -494,7 +508,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
 	bitmap_set(allocated_irqs, start, cnt);
 	mutex_unlock(&sparse_irq_lock);
-	return alloc_descs(start, cnt, node, owner);
+	return alloc_descs(start, cnt, node, owner, targetcpu);
 
 err:
 	mutex_unlock(&sparse_irq_lock);
@@ -512,7 +526,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_alloc_hwirqs(int cnt, int node)
 {
-	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL);
+	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL, -1);
 
 	if (irq < 0)
 		return 0;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 3a519a0..af7a39f 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -483,7 +483,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node));
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), -1);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -839,19 +839,23 @@ const struct irq_domain_ops irq_domain_simple_ops = {
 EXPORT_SYMBOL_GPL(irq_domain_simple_ops);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node)
+			   int node, int targetcpu)
 {
 	unsigned int hint;
 
 	if (virq >= 0) {
-		virq = irq_alloc_descs(virq, virq, cnt, node);
+		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
+					 targetcpu);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
-		virq = irq_alloc_descs_from(hint, cnt, node);
-		if (virq <= 0 && hint > 1)
-			virq = irq_alloc_descs_from(1, cnt, node);
+		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
+					 targetcpu);
+		if (virq <= 0 && hint > 1) {
+			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
+						 targetcpu);
+		}
 	}
 
 	return virq;
@@ -1163,6 +1167,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  * @node:	NUMA node id for memory allocation
  * @arg:	domain specific argument
  * @realloc:	IRQ descriptors have already been allocated if true
+ * @targetcpu:	Optional target cpu multiqueue devices (otherwise -1)
  *
  * Allocate IRQ numbers and initialized all data structures to support
  * hierarchy IRQ domains.
@@ -1178,7 +1183,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  */
 int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 			    unsigned int nr_irqs, int node, void *arg,
-			    bool realloc)
+			    bool realloc, int targetcpu)
 {
 	int i, ret, virq;
 
@@ -1196,7 +1201,8 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 	if (realloc && irq_base >= 0) {
 		virq = irq_base;
 	} else {
-		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node);
+		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
+					      targetcpu);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 38e89ce..3385729 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -324,7 +324,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_domain_ops *ops = info->ops;
 	msi_alloc_info_t arg;
 	struct msi_desc *desc;
-	int i, ret, virq = -1;
+	int i, ret, virq = -1, cpu = -1;
 
 	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
 	if (ret)
@@ -337,8 +337,15 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 		else
 			virq = -1;
 
+		if (dev->irq_affinity) {
+			cpu = cpumask_next(cpu, dev->irq_affinity);
+			if (cpu > nr_cpu_ids)
+				cpu = cpumask_first(dev->irq_affinity);
+		}
+
 		virq = __irq_domain_alloc_irqs(domain, virq, desc->nvec_used,
-					       dev_to_node(dev), &arg, false);
+					       dev_to_node(dev), &arg, false,
+					       cpu);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 1/8] device: Add irq affinity hint cpumask pointer Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 2/8] genirq: Make use of dev->irq_affinity Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller Christoph Hellwig
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/interrupt.h | 10 +++++++++
 kernel/irq/Makefile       |  1 +
 kernel/irq/affinity.c     | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+)
 create mode 100644 kernel/irq/affinity.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..67bc1e1f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +311,13 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	return 0;
 }
+
+static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int nr_vecs)
+{
+	*affinity_mask = NULL;
+	return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 0000000..ecb8915
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,54 @@
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+
+static int get_first_sibling(unsigned int cpu)
+{
+	unsigned int ret;
+
+	ret = cpumask_first(topology_sibling_cpumask(cpu));
+	if (ret < nr_cpu_ids)
+		return ret;
+	return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assining the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int nr_vecs)
+{
+	if (nr_vecs == 1) {
+		*affinity_mask = NULL;
+		return 0;
+	}
+
+	*affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!*affinity_mask)
+		return -ENOMEM;
+
+	if (nr_vecs >= num_online_cpus()) {
+		cpumask_copy(*affinity_mask, cpu_online_mask);
+	} else {
+		unsigned int cpu;
+
+		for_each_online_cpu(cpu) {
+			if (cpu == get_first_sibling(cpu))
+				cpumask_set_cpu(cpu, *affinity_mask);
+
+			if (--nr_vecs == 0)
+				break;
+		}
+	}
+
+	return 0;
+}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-04-16  1:35 ` [PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/interrupt.h |  2 ++
 kernel/irq/manage.c       | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 67bc1e1f..ae345da 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -713,4 +713,6 @@ extern char __softirqentry_text_end[];
 #define __softirq_entry
 #endif
 
+void irq_program_affinity(unsigned int irq);
+
 #endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index cc1cc64..02552a4 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -240,6 +240,20 @@ int __irq_set_affinity(unsigned int irq, const struct cpumask *mask, bool force)
 	return ret;
 }
 
+void irq_program_affinity(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irq_data *data = irq_desc_get_irq_data(desc);
+	unsigned long flags;
+
+	if (WARN_ON_ONCE(!desc))
+		return;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	irq_set_affinity_locked(data, desc->irq_common_data.affinity, false);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+
 int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 {
 	unsigned long flags;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
                   ` (3 preceding siblings ...)
  2016-04-16  1:35 ` [PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 6/8] pci: provide sensible irq vector alloc/free routines Christoph Hellwig
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in behavior.

XXX: Just as with the core IRQ spreading code this doesn't handle CPU
hotplug yet.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  60 ++++++++++++++++++++++++-
 block/blk-mq.h         |   8 ----
 include/linux/blk-mq.h |   1 +
 5 files changed, 60 insertions(+), 131 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1699baf..ab156bb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1954,6 +1955,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2013,6 +2030,30 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	blk_mq_sysfs_register(q);
 }
 
+static unsigned int *
+affinity_mask_to_queue_map(const struct cpumask *affinity_mask, int node)
+{
+	unsigned int *map;
+	int queue = -1, cpu = 0;
+
+	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL, node);
+	if (!map)
+		return NULL;
+
+	if (!affinity_mask)
+		return map;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue > 0)
+			map[cpu] = queue;
+	}
+
+	return map;
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
@@ -2028,7 +2069,21 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
+	if (set->affinity_mask) {
+		q->mq_map = affinity_mask_to_queue_map(set->affinity_mask,
+				set->numa_node);
+	} else {
+		struct cpumask *affinity_mask;
+		int ret;
+
+		ret = irq_create_affinity_mask(&affinity_mask, set->nr_hw_queues);
+		if (ret)
+			goto err_map;
+
+		q->mq_map = affinity_mask_to_queue_map(affinity_mask, set->numa_node);
+		kfree(affinity_mask);
+	}
+
 	if (!q->mq_map)
 		goto err_map;
 
@@ -2111,7 +2166,8 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
+//	XXX: figure out what to do about cpu hotplug in the new world order
+//	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
 
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..fe7e21f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,14 +45,6 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
- */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
-/*
  * sysfs helpers
  */
 extern int blk_mq_sysfs_register(struct request_queue *q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2498fdf..21ffe10 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -74,6 +74,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
                   ` (4 preceding siblings ...)
  2016-04-16  1:35 ` [PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-29 21:16   ` Bjorn Helgaas
  2016-04-16  1:35 ` [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
  2016-04-16  1:35 ` [PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
  7 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Hide all the MSI-X vs MSI vs legacy bullshit, and provide an array of
interrupt vectors in the pci_dev structure, and ensure we get proper
interrupt affinity by default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/pci/irq.c   | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/msi.c   |  2 +-
 drivers/pci/pci.h   |  5 +++
 include/linux/pci.h |  5 +++
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index 6684f15..b683465 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -1,7 +1,8 @@
 /*
- * PCI IRQ failure handing code
+ * PCI IRQ handing code
  *
  * Copyright (c) 2008 James Bottomley <James.Bottomley@HansenPartnership.com>
+ * Copyright (c) 2016 Christoph Hellwig.
  */
 
 #include <linux/acpi.h>
@@ -9,6 +10,92 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/pci.h>
+#include <linux/interrupt.h>
+#include "pci.h"
+
+static int pci_nr_irq_vectors(struct pci_dev *pdev)
+{
+	int nr_entries;
+
+	nr_entries = pci_msix_vec_count(pdev);
+	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
+		nr_entries = pci_msi_vec_count(pdev);
+	if (nr_entries <= 0)
+		nr_entries = 1;
+	return nr_entries;
+}
+
+static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
+		int nr_vecs)
+{
+	struct msix_entry *msix_entries;
+	int vecs, i;
+
+	msix_entries = kcalloc(nr_vecs, sizeof(struct msix_entry), GFP_KERNEL);
+	if (!msix_entries)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_vecs; i++)
+		msix_entries[i].entry = i;
+
+	vecs = pci_enable_msix_range(pdev, msix_entries, 1, nr_vecs);
+	if (vecs > 0) {
+		for (i = 0; i < vecs; i++)
+			irqs[i] = msix_entries[i].vector;
+	}
+
+	kfree(msix_entries);
+	return vecs;
+}
+
+int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
+{
+	int vecs, ret, i;
+	u32 *irqs;
+
+	nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
+
+	irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
+	if (!irqs)
+		return -ENOMEM;
+
+	vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
+	if (vecs <= 0) {
+		vecs = pci_enable_msi_range(pdev, 1, min(nr_vecs, 32));
+		if (vecs <= 0) {
+			ret = -EIO;
+			if (!pdev->irq)
+				goto out_free_irqs;
+
+			/* use legacy irq */
+			vecs = 1;
+		}
+
+		for (i = 0; i < vecs; i++)
+			irqs[i] = pdev->irq + i;
+	}
+
+	pdev->irqs = irqs;
+	return vecs;
+
+out_free_irqs:
+	kfree(irqs);
+	return ret;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+void pci_free_irq_vectors(struct pci_dev *pdev)
+{
+	if (pdev->msi_enabled)
+		pci_disable_msi(pdev);
+	else if (pdev->msix_enabled)
+		pci_disable_msix(pdev);
+
+	kfree(pdev->dev.irq_affinity);
+	pdev->dev.irq_affinity = NULL;
+	kfree(pdev->irqs);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
 
 static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
 {
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..544d306 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -815,7 +815,7 @@ out_free:
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 1, else return 0.
  **/
-static int pci_msi_supported(struct pci_dev *dev, int nvec)
+int pci_msi_supported(struct pci_dev *dev, int nvec)
 {
 	struct pci_bus *bus;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d0fb934..263422c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -144,8 +144,13 @@ extern unsigned int pci_pm_d3_delay;
 
 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_msi_supported(struct pci_dev *dev, int nvec);
 #else
 static inline void pci_no_msi(void) { }
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
+{
+	return 0;
+}
 #endif
 
 static inline void pci_msi_set_enable(struct pci_dev *dev, int enable)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 004b813..4fbc14f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -322,6 +322,7 @@ struct pci_dev {
 	 * directly, use the values stored here. They might be different!
 	 */
 	unsigned int	irq;
+	unsigned int	*irqs;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
 
 	bool match_driver;		/* Skip attaching driver */
@@ -1235,6 +1236,9 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
 int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 		      unsigned int command_bits, u32 flags);
 
+int pci_alloc_irq_vectors(struct pci_dev *dev, int nr_vecs);
+void pci_free_irq_vectors(struct pci_dev *pdev);
+
 /* kmem_cache style wrapper around pci_alloc_consistent() */
 
 #include <linux/pci-dma.h>
@@ -1282,6 +1286,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 		return rc;
 	return 0;
 }
+
 #else
 static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
 static inline void pci_msi_shutdown(struct pci_dev *dev) { }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
                   ` (5 preceding siblings ...)
  2016-04-16  1:35 ` [PATCH 6/8] pci: provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  2016-04-18  8:30   ` Thomas Gleixner
  2016-04-16  1:35 ` [PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
  7 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Set the affinity_mask before allocating vectors.  And for now we also
need a little hack after allocation, hopefully someone smarter than me
can move this into the core code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/pci/irq.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index b683465..d26df69 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -55,9 +55,14 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
 
 	nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
 
+	ret = irq_create_affinity_mask(&pdev->dev.irq_affinity, nr_vecs);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
 	irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
 	if (!irqs)
-		return -ENOMEM;
+		goto out_free_affinity;
 
 	vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
 	if (vecs <= 0) {
@@ -75,11 +80,20 @@ int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
 			irqs[i] = pdev->irq + i;
 	}
 
+	/* XXX: this should really move into the core IRQ allocation code.. */
+	if (vecs > 1) {
+		for (i = 0; i < vecs; i++)
+			irq_program_affinity(irqs[i]);
+	}
+
 	pdev->irqs = irqs;
 	return vecs;
 
 out_free_irqs:
 	kfree(irqs);
+out_free_affinity:
+	kfree(pdev->dev.irq_affinity);
+	pdev->dev.irq_affinity = NULL;
 	return ret;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors
  2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
                   ` (6 preceding siblings ...)
  2016-04-16  1:35 ` [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-04-16  1:35 ` Christoph Hellwig
  7 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2016-04-16  1:35 UTC (permalink / raw)
  To: tglx, linux-block, linux-pci; +Cc: linux-nvme, linux-kernel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 88 +++++++++++++------------------------------------
 1 file changed, 23 insertions(+), 65 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff3c8d7..82730bf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -89,7 +89,6 @@ struct nvme_dev {
 	unsigned max_qid;
 	int q_depth;
 	u32 db_stride;
-	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct scan_work;
@@ -209,6 +208,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+	return to_pci_dev(nvmeq->dev->dev)->irqs[nvmeq->cq_vector];
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 				unsigned int hctx_idx)
 {
@@ -1016,7 +1020,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 		spin_unlock_irq(&nvmeq->q_lock);
 		return 1;
 	}
-	vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+	vector = nvmeq_irq(nvmeq);
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -1024,7 +1028,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
 
 	return 0;
@@ -1135,11 +1138,11 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 							const char *name)
 {
 	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
-	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-				IRQF_SHARED, name, nvmeq);
+		return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+				nvme_irq, IRQF_SHARED, name, nvmeq);
+	else
+		return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+				name, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1438,7 +1441,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int result, i, vecs, nr_io_queues, size;
+	int result, nr_io_queues, size;
 
 	nr_io_queues = num_possible_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1481,29 +1484,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	}
 
 	/* Deregister the admin queue's interrupt */
-	free_irq(dev->entry[0].vector, adminq);
+	free_irq(pdev->irqs[0], adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
-
-	for (i = 0; i < nr_io_queues; i++)
-		dev->entry[i].entry = i;
-	vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-	if (vecs < 0) {
-		vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-		if (vecs < 0) {
-			vecs = 1;
-		} else {
-			for (i = 0; i < vecs; i++)
-				dev->entry[i].vector = i + pdev->irq;
-		}
-	}
+	pci_free_irq_vectors(pdev);
+	nr_io_queues = pci_alloc_irq_vectors(pdev, nr_io_queues);
+	if (nr_io_queues <= 0)
+		return -EIO;
+	dev->max_qid = nr_io_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -1511,8 +1502,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * path to scale better, even if the receive path is limited by the
 	 * number of interrupts.
 	 */
-	nr_io_queues = vecs;
-	dev->max_qid = nr_io_queues;
 
 	result = queue_request_irq(dev, adminq, adminq->irqname);
 	if (result) {
@@ -1526,22 +1515,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static void nvme_set_irq_hints(struct nvme_dev *dev)
-{
-	struct nvme_queue *nvmeq;
-	int i;
-
-	for (i = 0; i < dev->online_queues; i++) {
-		nvmeq = dev->queues[i];
-
-		if (!nvmeq->tags || !(*nvmeq->tags))
-			continue;
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-					blk_mq_tags_cpumask(*nvmeq->tags));
-	}
-}
-
 static void nvme_dev_scan(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, scan_work);
@@ -1549,7 +1522,6 @@ static void nvme_dev_scan(struct work_struct *work)
 	if (!dev->tagset.tags)
 		return;
 	nvme_scan_namespaces(&dev->ctrl);
-	nvme_set_irq_hints(dev);
 }
 
 static void nvme_del_queue_end(struct request *req, int error)
@@ -1654,6 +1626,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.cmd_size = nvme_cmd_size(dev);
 		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->tagset.driver_data = dev;
+		dev->tagset.affinity_mask = dev->dev->irq_affinity;
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
@@ -1694,15 +1667,9 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 	 * interrupts. Pre-enable a single MSIX or MSI vec for setup. We'll
 	 * adjust this later.
 	 */
-	if (pci_enable_msix(pdev, dev->entry, 1)) {
-		pci_enable_msi(pdev);
-		dev->entry[0].vector = pdev->irq;
-	}
-
-	if (!dev->entry[0].vector) {
-		result = -ENODEV;
-		goto disable;
-	}
+	result = pci_alloc_irq_vectors(pdev, 1);
+	if (result < 0)
+		return result;
 
 	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 
@@ -1744,10 +1711,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
+	pci_free_irq_vectors(pdev);
 
 	if (pci_is_enabled(pdev)) {
 		pci_disable_pcie_error_reporting(pdev);
@@ -1816,7 +1780,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	if (dev->ctrl.admin_q)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 }
 
@@ -1996,10 +1959,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
-							GFP_KERNEL, node);
-	if (!dev->entry)
-		goto free;
 	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
 							GFP_KERNEL, node);
 	if (!dev->queues)
@@ -2042,7 +2001,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_unmap(dev);
  free:
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 	return result;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-04-16  1:35 ` [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-04-18  8:30   ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2016-04-18  8:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

On Fri, 15 Apr 2016, Christoph Hellwig wrote:
> Set the affinity_mask before allocating vectors.  And for now we also
> need a little hack after allocation, hopefully someone smarter than me
> can move this into the core code.
> 
>  
> +	/* XXX: this should really move into the core IRQ allocation code.. */
> +	if (vecs > 1) {
> +		for (i = 0; i < vecs; i++)
> +			irq_program_affinity(irqs[i]);

No. We don't want to do that at allocation time. The problem here is that we
set the IRQF_NOBALANCING flag for the allocated interrupts and therefor the
affinity is not set from request_irq(). We'll fix it there.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-04-16  1:35 ` [PATCH 6/8] pci: provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-04-29 21:16   ` Bjorn Helgaas
  2016-05-01 18:01     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2016-04-29 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, linux-block, linux-pci, linux-nvme, linux-kernel,
	Alexander Gordeev

[+cc Alexander]

Sorry to be a pedant, but can you please edit the subject to be:

  PCI: Provide sensible IRQ vector alloc/free routines

so it matches the drivers/pci convention?

I like this idea a lot.  The MSI-X/MSI interfaces are much better than
they used to be, and I think this would be another significant
improvement.  What do you think, Alexander?  Here's the whole series
in case you don't have it handy:
http://lkml.kernel.org/r/1460770552-31260-1-git-send-email-hch@lst.de

On Fri, Apr 15, 2016 at 06:35:50PM -0700, Christoph Hellwig wrote:
> Hide all the MSI-X vs MSI vs legacy bullshit, and provide an array of
> interrupt vectors in the pci_dev structure, and ensure we get proper
> interrupt affinity by default.

This patch doesn't do anything for affinity by itself.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/pci/irq.c   | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi.c   |  2 +-
>  drivers/pci/pci.h   |  5 +++
>  include/linux/pci.h |  5 +++
>  4 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
> index 6684f15..b683465 100644
> --- a/drivers/pci/irq.c
> +++ b/drivers/pci/irq.c
> @@ -1,7 +1,8 @@
>  /*
> - * PCI IRQ failure handing code
> + * PCI IRQ handing code

s/handing/handling/ :)

>   *
>   * Copyright (c) 2008 James Bottomley <James.Bottomley@HansenPartnership.com>
> + * Copyright (c) 2016 Christoph Hellwig.
>   */
>  
>  #include <linux/acpi.h>
> @@ -9,6 +10,92 @@
>  #include <linux/kernel.h>
>  #include <linux/export.h>
>  #include <linux/pci.h>
> +#include <linux/interrupt.h>
> +#include "pci.h"
> +
> +static int pci_nr_irq_vectors(struct pci_dev *pdev)
> +{
> +	int nr_entries;
> +
> +	nr_entries = pci_msix_vec_count(pdev);
> +	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
> +		nr_entries = pci_msi_vec_count(pdev);
> +	if (nr_entries <= 0)
> +		nr_entries = 1;
> +	return nr_entries;
> +}
> +
> +static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
> +		int nr_vecs)
> +{
> +	struct msix_entry *msix_entries;
> +	int vecs, i;
> +
> +	msix_entries = kcalloc(nr_vecs, sizeof(struct msix_entry), GFP_KERNEL);
> +	if (!msix_entries)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nr_vecs; i++)
> +		msix_entries[i].entry = i;
> +
> +	vecs = pci_enable_msix_range(pdev, msix_entries, 1, nr_vecs);
> +	if (vecs > 0) {
> +		for (i = 0; i < vecs; i++)
> +			irqs[i] = msix_entries[i].vector;
> +	}
> +
> +	kfree(msix_entries);
> +	return vecs;
> +}
> +
> +int pci_alloc_irq_vectors(struct pci_dev *pdev, int nr_vecs)
> +{
> +	int vecs, ret, i;
> +	u32 *irqs;
> +
> +	nr_vecs = min(nr_vecs, pci_nr_irq_vectors(pdev));
> +
> +	irqs = kcalloc(nr_vecs, sizeof(u32), GFP_KERNEL);
> +	if (!irqs)
> +		return -ENOMEM;
> +
> +	vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
> +	if (vecs <= 0) {
> +		vecs = pci_enable_msi_range(pdev, 1, min(nr_vecs, 32));

I don't see one, but seems like we should have a #define for this
"32".  I guess pci_enable_msi_range() already protects itself, so this
min() is probably not strictly necessary anyway.

> +		if (vecs <= 0) {
> +			ret = -EIO;
> +			if (!pdev->irq)
> +				goto out_free_irqs;
> +
> +			/* use legacy irq */
> +			vecs = 1;
> +		}
> +
> +		for (i = 0; i < vecs; i++)
> +			irqs[i] = pdev->irq + i;
> +	}
> +
> +	pdev->irqs = irqs;
> +	return vecs;
> +
> +out_free_irqs:
> +	kfree(irqs);
> +	return ret;

  return -EIO;

and remove "ret".

> +}
> +EXPORT_SYMBOL(pci_alloc_irq_vectors);
> +
> +void pci_free_irq_vectors(struct pci_dev *pdev)
> +{
> +	if (pdev->msi_enabled)
> +		pci_disable_msi(pdev);
> +	else if (pdev->msix_enabled)
> +		pci_disable_msix(pdev);
> +
> +	kfree(pdev->dev.irq_affinity);
> +	pdev->dev.irq_affinity = NULL;

These two lines belong in a different patch.

> +	kfree(pdev->irqs);
> +}
> +EXPORT_SYMBOL(pci_free_irq_vectors);
>  
>  static void pci_note_irq_problem(struct pci_dev *pdev, const char *reason)
>  {
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index a080f44..544d306 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -815,7 +815,7 @@ out_free:
>   * to determine if MSI/-X are supported for the device. If MSI/-X is
>   * supported return 1, else return 0.
>   **/
> -static int pci_msi_supported(struct pci_dev *dev, int nvec)
> +int pci_msi_supported(struct pci_dev *dev, int nvec)
>  {
>  	struct pci_bus *bus;
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index d0fb934..263422c 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -144,8 +144,13 @@ extern unsigned int pci_pm_d3_delay;
>  
>  #ifdef CONFIG_PCI_MSI
>  void pci_no_msi(void);
> +int pci_msi_supported(struct pci_dev *dev, int nvec);
>  #else
>  static inline void pci_no_msi(void) { }
> +static int pci_msi_supported(struct pci_dev *dev, int nvec)
> +{
> +	return 0;
> +}
>  #endif
>  
>  static inline void pci_msi_set_enable(struct pci_dev *dev, int enable)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 004b813..4fbc14f 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -322,6 +322,7 @@ struct pci_dev {
>  	 * directly, use the values stored here. They might be different!
>  	 */
>  	unsigned int	irq;
> +	unsigned int	*irqs;
>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
>  
>  	bool match_driver;		/* Skip attaching driver */
> @@ -1235,6 +1236,9 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
>  int pci_set_vga_state(struct pci_dev *pdev, bool decode,
>  		      unsigned int command_bits, u32 flags);
>  
> +int pci_alloc_irq_vectors(struct pci_dev *dev, int nr_vecs);
> +void pci_free_irq_vectors(struct pci_dev *pdev);
> +
>  /* kmem_cache style wrapper around pci_alloc_consistent() */
>  
>  #include <linux/pci-dma.h>
> @@ -1282,6 +1286,7 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  		return rc;
>  	return 0;
>  }
> +
>  #else
>  static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
>  static inline void pci_msi_shutdown(struct pci_dev *dev) { }
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-04-29 21:16   ` Bjorn Helgaas
@ 2016-05-01 18:01     ` Christoph Hellwig
  2016-05-02 13:11       ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-05-01 18:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Christoph Hellwig, tglx, linux-block, linux-pci, linux-nvme,
	linux-kernel, Alexander Gordeev

On Fri, Apr 29, 2016 at 04:16:39PM -0500, Bjorn Helgaas wrote:
> Sorry to be a pedant, but can you please edit the subject to be:
> 
>   PCI: Provide sensible IRQ vector alloc/free routines

sure.

> 
> so it matches the drivers/pci convention?
> 
> I like this idea a lot.  The MSI-X/MSI interfaces are much better than
> they used to be, and I think this would be another significant
> improvement.  What do you think, Alexander?  Here's the whole series
> in case you don't have it handy:
> http://lkml.kernel.org/r/1460770552-31260-1-git-send-email-hch@lst.de

FYI, I spent some time trying to convert more drivers to this, and
I think we'll need an additional flag to skip MSI or MSI-X as there
is plenty of hardware claiming support in the capabilities flag,
but not actually supporting one of them.

> > Hide all the MSI-X vs MSI vs legacy bullshit, and provide an array of
> > interrupt vectors in the pci_dev structure, and ensure we get proper
> > interrupt affinity by default.
> 
> This patch doesn't do anything for affinity by itself.

it used to in an earlier incarnation before I split that out.  But yes,
the changelog should be updated.

> > +	vecs = pci_enable_msix_range_wrapper(pdev, irqs, nr_vecs);
> > +	if (vecs <= 0) {
> > +		vecs = pci_enable_msi_range(pdev, 1, min(nr_vecs, 32));
> 
> I don't see one, but seems like we should have a #define for this
> "32".  I guess pci_enable_msi_range() already protects itself, so this
> min() is probably not strictly necessary anyway.

Ok, I'll take a look an will either remove it entirely or add an
define depending on the audit.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-05-01 18:01     ` Christoph Hellwig
@ 2016-05-02 13:11       ` Bjorn Helgaas
  2016-05-02 14:42         ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2016-05-02 13:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, linux-block, linux-pci, linux-nvme, linux-kernel,
	Alexander Gordeev

On Sun, May 01, 2016 at 08:01:49PM +0200, Christoph Hellwig wrote:
> FYI, I spent some time trying to convert more drivers to this, and
> I think we'll need an additional flag to skip MSI or MSI-X as there
> is plenty of hardware claiming support in the capabilities flag,
> but not actually supporting one of them.

Or maybe add a "pdev->msix_broken" bit and quirks to set it?  Or if
pci_fixup_final quirks merely cleared pdev->msix_cap, I think the PCI
core would never try to enable MSI-X.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-05-02 13:11       ` Bjorn Helgaas
@ 2016-05-02 14:42         ` Christoph Hellwig
  2016-05-02 15:29           ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-05-02 14:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Christoph Hellwig, tglx, linux-block, linux-pci, linux-nvme,
	linux-kernel, Alexander Gordeev

On Mon, May 02, 2016 at 08:11:24AM -0500, Bjorn Helgaas wrote:
> On Sun, May 01, 2016 at 08:01:49PM +0200, Christoph Hellwig wrote:
> > FYI, I spent some time trying to convert more drivers to this, and
> > I think we'll need an additional flag to skip MSI or MSI-X as there
> > is plenty of hardware claiming support in the capabilities flag,
> > but not actually supporting one of them.
> 
> Or maybe add a "pdev->msix_broken" bit and quirks to set it?  Or if
> pci_fixup_final quirks merely cleared pdev->msix_cap, I think the PCI
> core would never try to enable MSI-X.

Can't say I'm excited about quirks - now we'd have to patch core
code for something that previously was entirely in the driver.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-05-02 14:42         ` Christoph Hellwig
@ 2016-05-02 15:29           ` Bjorn Helgaas
  2016-05-03 21:19             ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2016-05-02 15:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, linux-block, linux-pci, linux-nvme, linux-kernel,
	Alexander Gordeev

On Mon, May 02, 2016 at 04:42:03PM +0200, Christoph Hellwig wrote:
> On Mon, May 02, 2016 at 08:11:24AM -0500, Bjorn Helgaas wrote:
> > On Sun, May 01, 2016 at 08:01:49PM +0200, Christoph Hellwig wrote:
> > > FYI, I spent some time trying to convert more drivers to this, and
> > > I think we'll need an additional flag to skip MSI or MSI-X as there
> > > is plenty of hardware claiming support in the capabilities flag,
> > > but not actually supporting one of them.
> > 
> > Or maybe add a "pdev->msix_broken" bit and quirks to set it?  Or if
> > pci_fixup_final quirks merely cleared pdev->msix_cap, I think the PCI
> > core would never try to enable MSI-X.
> 
> Can't say I'm excited about quirks - now we'd have to patch core
> code for something that previously was entirely in the driver.

Yeah, you're right.  I was imagining a quirk in the driver itself, but
now that I look at it, I don't see any infrastructure for that.  I
think there are a lot of existing quirks that could be moved from the
core to a driver if we had support for quirks in drivers.

It just seems a shame to complicate the pci_alloc_irq_vectors()
interface with flags about broken devices.

I guess if we added a "pdev->msix_broken" bit, it would be visible to
drivers, and they could easily set it themselves in their .probe()
methods even without any actual quirk mechanism.  But a flag to
pci_alloc_irq_vectors() would certainly be more direct.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-05-02 15:29           ` Bjorn Helgaas
@ 2016-05-03 21:19             ` Christoph Hellwig
  2016-05-03 21:37               ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2016-05-03 21:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Christoph Hellwig, tglx, linux-block, linux-pci, linux-nvme,
	linux-kernel, Alexander Gordeev

Hi Bjorn,

I've implemented your suggestion and I'm getting ready to send out
a new version.  One thing that came to mind is:  do you prefer this
code in irq.c or would you rather have it in msi.c?  While it
also has a legacy irq fallback most of it tied pretty closely to
the msi.c code, so I wonder if we should group them together.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
  2016-05-03 21:19             ` Christoph Hellwig
@ 2016-05-03 21:37               ` Bjorn Helgaas
  0 siblings, 0 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2016-05-03 21:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, linux-block, linux-pci, linux-nvme, linux-kernel,
	Alexander Gordeev

On Tue, May 03, 2016 at 11:19:46PM +0200, Christoph Hellwig wrote:
> Hi Bjorn,
> 
> I've implemented your suggestion and I'm getting ready to send out
> a new version.  One thing that came to mind is:  do you prefer this
> code in irq.c or would you rather have it in msi.c?  While it
> also has a legacy irq fallback most of it tied pretty closely to
> the msi.c code, so I wonder if we should group them together.

Good question.  There isn't much in irq.c, and the interesting bits
are the MSI-related things, so maybe msi.c would make more sense.

Bjorn

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-05-03 21:38 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-16  1:35 RFC: automatic interrupt affinity for MSI/MSI-X capable devices Christoph Hellwig
2016-04-16  1:35 ` [PATCH 1/8] device: Add irq affinity hint cpumask pointer Christoph Hellwig
2016-04-16  1:35 ` [PATCH 2/8] genirq: Make use of dev->irq_affinity Christoph Hellwig
2016-04-16  1:35 ` [PATCH 3/8] genirq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
2016-04-16  1:35 ` [PATCH 4/8] genirq: add a helper to program the pre-set affinity mask into the controller Christoph Hellwig
2016-04-16  1:35 ` [PATCH 5/8] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
2016-04-16  1:35 ` [PATCH 6/8] pci: provide sensible irq vector alloc/free routines Christoph Hellwig
2016-04-29 21:16   ` Bjorn Helgaas
2016-05-01 18:01     ` Christoph Hellwig
2016-05-02 13:11       ` Bjorn Helgaas
2016-05-02 14:42         ` Christoph Hellwig
2016-05-02 15:29           ` Bjorn Helgaas
2016-05-03 21:19             ` Christoph Hellwig
2016-05-03 21:37               ` Bjorn Helgaas
2016-04-16  1:35 ` [PATCH 7/8] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
2016-04-18  8:30   ` Thomas Gleixner
2016-04-16  1:35 ` [PATCH 8/8] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).