linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* automatic interrupt affinity for MSI/MSI-X capable devices V2
@ 2016-06-14 19:58 Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
                   ` (14 more replies)
  0 siblings, 15 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

This series enhances the irq and PCI code to allow spreading around MSI and
MSI-X vectors so that they have per-cpu affinity if possible, or at least
per-node.  For that it takes the algorithm from blk-mq, moves it to
a common place, and makes it available through a vastly simplified PCI
interrupt allocation API.  It then switches blk-mq to be able to pick up
the queue mapping from the device if available, and demonstrates all this
using the NVMe driver.

There also is a git tree available at:

   git://git.infradead.org/users/hch/block.git

Gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/msix-spreading.4

Changes since V1:
 - irq core improvements to properly assign the affinity before
   request_irq (tglx)
 - better handling of the MSI vs MSI-X differences in the low level
   MSI allocator (hch and tglx)
 - various improvements to pci_alloc_irq_vectors (hch)
 - remove blk-mq hardware queue reassigned on hotplug cpu events (hch)
 - forward ported to Jens' current for-next tree (hch)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-16  9:05   ` Bart Van Assche
  2016-06-14 19:58 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

No user and we definitely don't want to grow one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h | 6 ++----
 kernel/irq/msi.c    | 8 ++------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8b425c6..c33abfa 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -264,12 +264,10 @@ enum {
 	 * callbacks.
 	 */
 	MSI_FLAG_USE_DEF_CHIP_OPS	= (1 << 1),
-	/* Build identity map between hwirq and irq */
-	MSI_FLAG_IDENTITY_MAP		= (1 << 2),
 	/* Support multiple PCI MSI interrupts */
-	MSI_FLAG_MULTI_PCI_MSI		= (1 << 3),
+	MSI_FLAG_MULTI_PCI_MSI		= (1 << 2),
 	/* Support PCI MSIX interrupts */
-	MSI_FLAG_PCI_MSIX		= (1 << 4),
+	MSI_FLAG_PCI_MSIX		= (1 << 3),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 38e89ce..eb5bf2b 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -324,7 +324,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	struct msi_domain_ops *ops = info->ops;
 	msi_alloc_info_t arg;
 	struct msi_desc *desc;
-	int i, ret, virq = -1;
+	int i, ret, virq;
 
 	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
 	if (ret)
@@ -332,12 +332,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	for_each_msi_entry(desc, dev) {
 		ops->set_desc(&arg, desc);
-		if (info->flags & MSI_FLAG_IDENTITY_MAP)
-			virq = (int)ops->get_hwirq(info, &arg);
-		else
-			virq = -1;
 
-		virq = __irq_domain_alloc_irqs(domain, virq, desc->nvec_used,
+		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false);
 		if (virq < 0) {
 			ret = -ENOSPC;
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-15  8:44   ` Bart Van Assche
  2016-06-16  9:08   ` Bart Van Assche
  2016-06-14 19:58 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq.h    |  7 +++++++
 kernel/irq/internals.h |  2 ++
 kernel/irq/manage.c    | 21 ++++++++++++++++++---
 kernel/irq/proc.c      |  2 +-
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 4d758a7..49d66d1 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -197,6 +197,7 @@ struct irq_data {
  * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
  * IRQD_WAKEUP_ARMED		- Wakeup mode armed
  * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
+ * IRQD_AFFINITY_MANAGED	- Affinity is managed automatically
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -212,6 +213,7 @@ enum {
 	IRQD_IRQ_INPROGRESS		= (1 << 18),
 	IRQD_WAKEUP_ARMED		= (1 << 19),
 	IRQD_FORWARDED_TO_VCPU		= (1 << 20),
+	IRQD_AFFINITY_MANAGED		= (1 << 21),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -305,6 +307,11 @@ static inline void irqd_clr_forwarded_to_vcpu(struct irq_data *d)
 	__irqd_to_state(d) &= ~IRQD_FORWARDED_TO_VCPU;
 }
 
+static inline bool irqd_affinity_is_managed(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 09be2c9..b15aa3b 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -105,6 +105,8 @@ static inline void unregister_handler_proc(unsigned int irq,
 					   struct irqaction *action) { }
 #endif
 
+extern bool irq_can_set_affinity_usr(unsigned int irq);
+
 extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask);
 
 extern void irq_set_thread_affinity(struct irq_desc *desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ef0bc02..30658e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -115,12 +115,12 @@ EXPORT_SYMBOL(synchronize_irq);
 #ifdef CONFIG_SMP
 cpumask_var_t irq_default_affinity;
 
-static int __irq_can_set_affinity(struct irq_desc *desc)
+static bool __irq_can_set_affinity(struct irq_desc *desc)
 {
 	if (!desc || !irqd_can_balance(&desc->irq_data) ||
 	    !desc->irq_data.chip || !desc->irq_data.chip->irq_set_affinity)
-		return 0;
-	return 1;
+		return false;
+	return true;
 }
 
 /**
@@ -134,6 +134,21 @@ int irq_can_set_affinity(unsigned int irq)
 }
 
 /**
+ * irq_can_set_affinity_usr - Check if affinity of a irq can be set from user space
+ * @irq:	Interrupt to check
+ *
+ * Like irq_can_set_affinity() above, but additionally checks for the
+ * AFFINITY_MANAGED flag.
+ */
+bool irq_can_set_affinity_usr(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	return __irq_can_set_affinity(desc) &&
+		!irqd_affinity_is_managed(&desc->irq_data);
+}
+
+/**
  *	irq_set_thread_affinity - Notify irq threads to adjust affinity
  *	@desc:		irq descriptor which has affitnity changed
  *
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4e1b947..40bdcdc 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -96,7 +96,7 @@ static ssize_t write_irq_affinity(int type, struct file *file,
 	cpumask_var_t new_value;
 	int err;
 
-	if (!irq_can_set_affinity(irq) || no_irq_affinity)
+	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
 		return -EIO;
 
 	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 03/13] irq: Add affinity hint to irq allocation
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/sparc/kernel/irq_64.c     |  2 +-
 arch/x86/kernel/apic/io_apic.c |  5 +++--
 include/linux/irq.h            |  4 ++--
 include/linux/irqdomain.h      |  9 ++++++---
 kernel/irq/ipi.c               |  2 +-
 kernel/irq/irqdesc.c           | 12 ++++++++----
 kernel/irq/irqdomain.c         | 22 ++++++++++++++--------
 kernel/irq/manage.c            |  7 ++++---
 kernel/irq/msi.c               |  3 ++-
 9 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index e22416c..34a7930 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -242,7 +242,7 @@ unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
 {
 	int irq;
 
-	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
+	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL, NULL);
 	if (irq <= 0)
 		goto out;
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 84e33ff..bca0c81 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -981,7 +981,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi,
 
 	return __irq_domain_alloc_irqs(domain, irq, 1,
 				       ioapic_alloc_attr_node(info),
-				       info, legacy);
+				       info, legacy, NULL);
 }
 
 /*
@@ -1014,7 +1014,8 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain,
 					  info->ioapic_pin))
 			return -ENOMEM;
 	} else {
-		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true);
+		irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+					      NULL);
 		if (irq >= 0) {
 			irq_data = irq_domain_get_irq_data(domain, irq);
 			data = irq_data->chip_data;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 49d66d1..63803a4 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -708,11 +708,11 @@ static inline struct cpumask *irq_data_get_affinity_mask(struct irq_data *d)
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		struct module *owner);
+		      struct module *owner, const struct cpumask *affinity);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
-	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index f1f36e0..1aee0fb 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -39,6 +39,7 @@ struct irq_domain;
 struct of_device_id;
 struct irq_chip;
 struct irq_data;
+struct cpumask;
 
 /* Number of irqs reserved for a legacy isa controller */
 #define NUM_ISA_INTERRUPTS	16
@@ -217,7 +218,8 @@ extern struct irq_domain *irq_find_matching_fwspec(struct irq_fwspec *fwspec,
 						   enum irq_domain_bus_token bus_token);
 extern void irq_set_default_host(struct irq_domain *host);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
-				  irq_hw_number_t hwirq, int node);
+				  irq_hw_number_t hwirq, int node,
+				  const struct cpumask *affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
@@ -389,7 +391,7 @@ static inline struct irq_domain *irq_domain_add_hierarchy(struct irq_domain *par
 
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 				   unsigned int nr_irqs, int node, void *arg,
-				   bool realloc);
+				   bool realloc, const struct cpumask *affinity);
 extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs);
 extern void irq_domain_activate_irq(struct irq_data *irq_data);
 extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
@@ -397,7 +399,8 @@ extern void irq_domain_deactivate_irq(struct irq_data *irq_data);
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
 			unsigned int nr_irqs, int node, void *arg)
 {
-	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false);
+	return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false,
+				       NULL);
 }
 
 extern int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 89b49f6..4fd2351 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -76,7 +76,7 @@ int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return -ENOMEM;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 8731e1c..b8df4fc 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -223,7 +223,7 @@ static void free_desc(unsigned int irq)
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
-		       struct module *owner)
+		       const struct cpumask *affinity, struct module *owner)
 {
 	struct irq_desc *desc;
 	int i;
@@ -333,6 +333,7 @@ static void free_desc(unsigned int irq)
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
+			      const struct cpumask *affinity,
 			      struct module *owner)
 {
 	u32 i;
@@ -453,12 +454,15 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @cnt:	Number of consecutive irqs to allocate.
  * @node:	Preferred node on which the irq descriptor should be allocated
  * @owner:	Owning module (can be NULL)
+ * @affinity:	Optional pointer to an affinity mask which hints where the
+ *		irq descriptors should be allocated and which default
+ *		affinities to use
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		  struct module *owner)
+		  struct module *owner, const struct cpumask *affinity)
 {
 	int start, ret;
 
@@ -494,7 +498,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
 	bitmap_set(allocated_irqs, start, cnt);
 	mutex_unlock(&sparse_irq_lock);
-	return alloc_descs(start, cnt, node, owner);
+	return alloc_descs(start, cnt, node, affinity, owner);
 
 err:
 	mutex_unlock(&sparse_irq_lock);
@@ -512,7 +516,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_alloc_hwirqs(int cnt, int node)
 {
-	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL);
+	int i, irq = __irq_alloc_descs(-1, 0, cnt, node, NULL, NULL);
 
 	if (irq < 0)
 		return 0;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 8798b6c..79459b7 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -481,7 +481,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node));
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -835,19 +835,23 @@ const struct irq_domain_ops irq_domain_simple_ops = {
 EXPORT_SYMBOL_GPL(irq_domain_simple_ops);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node)
+			   int node, const struct cpumask *affinity)
 {
 	unsigned int hint;
 
 	if (virq >= 0) {
-		virq = irq_alloc_descs(virq, virq, cnt, node);
+		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
+					 affinity);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
-		virq = irq_alloc_descs_from(hint, cnt, node);
-		if (virq <= 0 && hint > 1)
-			virq = irq_alloc_descs_from(1, cnt, node);
+		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
+					 affinity);
+		if (virq <= 0 && hint > 1) {
+			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
+						 affinity);
+		}
 	}
 
 	return virq;
@@ -1160,6 +1164,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  * @node:	NUMA node id for memory allocation
  * @arg:	domain specific argument
  * @realloc:	IRQ descriptors have already been allocated if true
+ * @affinity:	Optional irq affinity mask for multiqueue devices
  *
  * Allocate IRQ numbers and initialized all data structures to support
  * hierarchy IRQ domains.
@@ -1175,7 +1180,7 @@ int irq_domain_alloc_irqs_recursive(struct irq_domain *domain,
  */
 int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 			    unsigned int nr_irqs, int node, void *arg,
-			    bool realloc)
+			    bool realloc, const struct cpumask *affinity)
 {
 	int i, ret, virq;
 
@@ -1193,7 +1198,8 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 	if (realloc && irq_base >= 0) {
 		virq = irq_base;
 	} else {
-		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node);
+		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
+					      affinity);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 30658e9..ad0aac6 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -353,10 +353,11 @@ static int setup_affinity(struct irq_desc *desc, struct cpumask *mask)
 		return 0;
 
 	/*
-	 * Preserve an userspace affinity setup, but make sure that
-	 * one of the targets is online.
+	 * Preserve the managed affinity setting and an userspace affinity
+	 * setup, but make sure that one of the targets is online.
 	 */
-	if (irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
+	if (irqd_affinity_is_managed(&desc->irq_data) ||
+	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_common_data.affinity,
 				       cpu_online_mask))
 			set = desc->irq_common_data.affinity;
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index eb5bf2b..58dbbac 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -334,7 +334,8 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
-					       dev_to_node(dev), &arg, false);
+					       dev_to_node(dev), &arg, false,
+					       NULL);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 04/13] irq: Use affinity hint in irqdesc allocation
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-06-14 19:58 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/irqdesc.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b8df4fc..a623b44 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -68,9 +68,13 @@ static int alloc_masks(struct irq_desc *desc, gfp_t gfp, int node)
 	return 0;
 }
 
-static void desc_smp_init(struct irq_desc *desc, int node)
+static void desc_smp_init(struct irq_desc *desc, int node,
+			  const struct cpumask *affinity)
 {
-	cpumask_copy(desc->irq_common_data.affinity, irq_default_affinity);
+	if (!affinity)
+		affinity = irq_default_affinity;
+	cpumask_copy(desc->irq_common_data.affinity, affinity);
+
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_clear(desc->pending_mask);
 #endif
@@ -82,11 +86,12 @@ static void desc_smp_init(struct irq_desc *desc, int node)
 #else
 static inline int
 alloc_masks(struct irq_desc *desc, gfp_t gfp, int node) { return 0; }
-static inline void desc_smp_init(struct irq_desc *desc, int node) { }
+static inline void
+desc_smp_init(struct irq_desc *desc, int node, const struct cpumask *affinity) { }
 #endif
 
 static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
-		struct module *owner)
+			      const struct cpumask *affinity, struct module *owner)
 {
 	int cpu;
 
@@ -107,7 +112,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
-	desc_smp_init(desc, node);
+	desc_smp_init(desc, node, affinity);
 }
 
 int nr_irqs = NR_IRQS;
@@ -158,7 +163,9 @@ void irq_unlock_sparse(void)
 	mutex_unlock(&sparse_irq_lock);
 }
 
-static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
+static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
+				   const struct cpumask *affinity,
+				   struct module *owner)
 {
 	struct irq_desc *desc;
 	gfp_t gfp = GFP_KERNEL;
@@ -178,7 +185,8 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
 	lockdep_set_class(&desc->lock, &irq_desc_lock_class);
 	init_rcu_head(&desc->rcu);
 
-	desc_set_defaults(irq, desc, node, owner);
+	desc_set_defaults(irq, desc, node, affinity, owner);
+	irqd_set(&desc->irq_data, flags);
 
 	return desc;
 
@@ -225,11 +233,30 @@ static void free_desc(unsigned int irq)
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		       const struct cpumask *affinity, struct module *owner)
 {
+	const struct cpumask *mask = NULL;
 	struct irq_desc *desc;
-	int i;
+	unsigned int flags;
+	int i, cpu = -1;
+
+	if (affinity && cpumask_empty(affinity))
+		return -EINVAL;
+
+	flags = affinity ? IRQD_AFFINITY_MANAGED : 0;
 
 	for (i = 0; i < cnt; i++) {
-		desc = alloc_desc(start + i, node, owner);
+		if (affinity) {
+			cpu = cpumask_next(cpu, affinity);
+			if (cpu >= nr_cpu_ids)
+				cpu = cpumask_first(affinity);
+			node = cpu_to_node(cpu);
+
+			/*
+			 * For single allocations we use the caller provided
+			 * mask otherwise we use the mask of the target cpu
+			 */
+			mask = cnt == 1 ? affinity : cpumask_of(cpu);
+		}
+		desc = alloc_desc(start + i, node, flags, mask, owner);
 		if (!desc)
 			goto err;
 		mutex_lock(&sparse_irq_lock);
@@ -277,7 +304,7 @@ int __init early_irq_init(void)
 		nr_irqs = initcnt;
 
 	for (i = 0; i < initcnt; i++) {
-		desc = alloc_desc(i, node, NULL);
+		desc = alloc_desc(i, node, 0, NULL, NULL);
 		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
@@ -311,7 +338,7 @@ int __init early_irq_init(void)
 		alloc_masks(&desc[i], GFP_KERNEL, node);
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
-		desc_set_defaults(i, &desc[i], node, NULL);
+		desc_set_defaults(i, &desc[i], node, NULL, NULL);
 	}
 	return arch_early_irq_init();
 }
@@ -328,7 +355,7 @@ static void free_desc(unsigned int irq)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL);
+	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 }
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 05/13] irq/msi: Make use of affinity aware allocations
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2016-06-14 19:58 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-14 19:58 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

From: Thomas Gleixner <tglx@linutronix.de>

Allow the MSI code to provide affinity hints per MSI descriptor.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h | 2 ++
 kernel/irq/msi.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index c33abfa..4f0bfe5 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -47,6 +47,7 @@ struct fsl_mc_msi_desc {
  * @nvec_used:	The number of vectors used
  * @dev:	Pointer to the device which uses this descriptor
  * @msg:	The last set MSI message cached for reuse
+ * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
  *
  * @masked:	[PCI MSI/X] Mask bits
  * @is_msix:	[PCI MSI/X] True if MSI-X
@@ -67,6 +68,7 @@ struct msi_desc {
 	unsigned int			nvec_used;
 	struct device			*dev;
 	struct msi_msg			msg;
+	const struct cpumask		*affinity;
 
 	union {
 		/* PCI MSI/X specific data */
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 58dbbac..0e2a736 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -335,7 +335,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false,
-					       NULL);
+					       desc->affinity);
 		if (virq < 0) {
 			ret = -ENOSPC;
 			if (ops->handle_error)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2016-06-14 19:58 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
@ 2016-06-14 19:58 ` Christoph Hellwig
  2016-06-14 21:54   ` Guilherme G. Piccoli
  2016-06-25 20:05   ` Alexander Gordeev
  2016-06-14 19:59 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:58 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

This is lifted from the blk-mq code and adopted to use the affinity mask
concept just intruced in the irq handling code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/interrupt.h | 11 +++++++++
 kernel/irq/Makefile       |  1 +
 kernel/irq/affinity.c     | 60 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 72 insertions(+)
 create mode 100644 kernel/irq/affinity.c

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..12003c0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int *nr_vecs);
+
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -308,6 +311,14 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	return 0;
 }
+
+static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int *nr_vecs)
+{
+	*affinity_mask = NULL;
+	*nr_vecs = 1;
+	return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 2ee42e9..1d3ee31 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
 obj-$(CONFIG_PM_SLEEP) += pm.o
 obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
 obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
+obj-$(CONFIG_SMP) += affinity.o
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
new file mode 100644
index 0000000..1daf8fb
--- /dev/null
+++ b/kernel/irq/affinity.c
@@ -0,0 +1,60 @@
+
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+
+static int get_first_sibling(unsigned int cpu)
+{
+	unsigned int ret;
+
+	ret = cpumask_first(topology_sibling_cpumask(cpu));
+	if (ret < nr_cpu_ids)
+		return ret;
+	return cpu;
+}
+
+/*
+ * Take a map of online CPUs and the number of available interrupt vectors
+ * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
+ * so that they are distributed as good as possible around the CPUs.  If
+ * more vectors than CPUs are available we'll map one to each CPU,
+ * otherwise we map one to the first sibling of each socket.
+ *
+ * If there are more vectors than CPUs we will still only have one bit
+ * set per CPU, but interrupt code will keep on assining the vectors from
+ * the start of the bitmap until we run out of vectors.
+ */
+int irq_create_affinity_mask(struct cpumask **affinity_mask,
+		unsigned int *nr_vecs)
+{
+	unsigned int vecs = 0;
+
+	if (*nr_vecs == 1) {
+		*affinity_mask = NULL;
+		return 0;
+	}
+
+	*affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!*affinity_mask)
+		return -ENOMEM;
+
+	if (*nr_vecs >= num_online_cpus()) {
+		cpumask_copy(*affinity_mask, cpu_online_mask);
+	} else {
+		unsigned int cpu;
+
+		for_each_online_cpu(cpu) {
+			if (cpu == get_first_sibling(cpu)) {
+				cpumask_set_cpu(cpu, *affinity_mask);
+				vecs++;
+			}
+
+			if (--(*nr_vecs) == 0)
+				break;
+		}
+	}
+
+	*nr_vecs = vecs;
+	return 0;
+}
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2016-06-14 19:58 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-23 11:16   ` Alexander Gordeev
  2016-06-14 19:59 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Add a helper to allocate a range of interrupt vectors, which will
transparently use MSI-X and MSI if available or fallback to legacy
vectors.  The interrupts are available in a core managed array
in the pci_dev structure, and can also be released using a similar
helper.

The next patch will also add automatic spreading of MSI / MSI-X
vectors to this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/pci/msi.c   | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |  18 +++++++++
 2 files changed, 128 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a080f44..a33adec 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -4,6 +4,7 @@
  *
  * Copyright (C) 2003-2004 Intel
  * Copyright (C) Tom Long Nguyen (tom.l.nguyen@intel.com)
+ * Copyright (c) 2016 Christoph Hellwig.
  */
 
 #include <linux/err.h>
@@ -1120,6 +1121,115 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
+static unsigned int pci_nr_irq_vectors(struct pci_dev *pdev)
+{
+	int nr_entries;
+
+	nr_entries = pci_msix_vec_count(pdev);
+	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
+		nr_entries = pci_msi_vec_count(pdev);
+	if (nr_entries <= 0)
+		nr_entries = 1;
+	return nr_entries;
+}
+
+static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
+		unsigned int min_vecs, unsigned int max_vecs)
+{
+	struct msix_entry *msix_entries;
+	int vecs, i;
+
+	msix_entries = kcalloc(max_vecs, sizeof(struct msix_entry), GFP_KERNEL);
+	if (!msix_entries)
+		return -ENOMEM;
+
+	for (i = 0; i < max_vecs; i++)
+		msix_entries[i].entry = i;
+
+	vecs = pci_enable_msix_range(pdev, msix_entries, min_vecs, max_vecs);
+	if (vecs > 0) {
+		for (i = 0; i < vecs; i++)
+			irqs[i] = msix_entries[i].vector;
+	}
+
+	kfree(msix_entries);
+	return vecs;
+}
+
+/**
+ * pci_alloc_irq_vectors - allocate multiple IRQs for a device
+ * @dev:		PCI device to operate on
+ * @min_vecs:		minimum number of vectors required (must be >= 1)
+ * @max_vecs:		maximum (desired) number of vectors
+ * @flags:		flags or quirks for the allocation
+ *
+ * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI
+ * vectors if available, and fall back to a single legacy vector
+ * if neither is available.  Return the number of vectors allocated,
+ * (which might be smaller than @max_vecs) if successful, or a negative
+ * error code on error.  The Linux irq numbers for the allocated
+ * vectors are stored in pdev->irqs.  If less than @min_vecs interrupt
+ * vectors are available for @dev the function will fail with -ENOSPC.
+ */
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs, unsigned int flags)
+{
+	unsigned int vecs, i;
+	u32 *irqs;
+
+	max_vecs = min(max_vecs, pci_nr_irq_vectors(dev));
+
+	irqs = kcalloc(max_vecs, sizeof(u32), GFP_KERNEL);
+	if (!irqs)
+		return -ENOMEM;
+
+	if (!(flags & PCI_IRQ_NOMSIX)) {
+		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
+				max_vecs);
+		if (vecs > 0)
+			goto done;
+	}
+
+	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
+	if (vecs > 0) {
+		for (i = 0; i < vecs; i++)
+			irqs[i] = dev->irq + i;
+		goto done;
+	}
+
+	if (min_vecs > 1)
+		return -ENOSPC;
+
+	/* use legacy irq */
+	kfree(irqs);
+	dev->irqs = &dev->irq;
+	return 1;
+
+done:
+	dev->irqs = irqs;
+	return vecs;
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors);
+
+/**
+ * pci_free_irq_vectors - free previously allocated IRQs for a device
+ * @dev:		PCI device to operate on
+ *
+ * Undoes the allocations and enabling in pci_alloc_irq_vectors().
+ */
+void pci_free_irq_vectors(struct pci_dev *dev)
+{
+	if (dev->msix_enabled)
+		pci_disable_msix(dev);
+	else if (dev->msi_enabled)
+		pci_disable_msi(dev);
+
+	if (dev->irqs != &dev->irq)
+		kfree(dev->irqs);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors);
+
+
 struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
 {
 	return to_pci_dev(desc->dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b67e4df..84a20fc 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -320,6 +320,7 @@ struct pci_dev {
 	 * directly, use the values stored here. They might be different!
 	 */
 	unsigned int	irq;
+	unsigned int	*irqs;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
 
 	bool match_driver;		/* Skip attaching driver */
@@ -1237,6 +1238,8 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
 int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 		      unsigned int command_bits, u32 flags);
 
+#define PCI_IRQ_NOMSIX		(1 << 0) /* don't try to use MSI-X interrupts */
+
 /* kmem_cache style wrapper around pci_alloc_consistent() */
 
 #include <linux/pci-dma.h>
@@ -1284,6 +1287,9 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 		return rc;
 	return 0;
 }
+int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
+		unsigned int max_vecs, unsigned int flags);
+void pci_free_irq_vectors(struct pci_dev *dev);
 #else
 static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
 static inline void pci_msi_shutdown(struct pci_dev *dev) { }
@@ -1307,6 +1313,18 @@ static inline int pci_enable_msix_range(struct pci_dev *dev,
 static inline int pci_enable_msix_exact(struct pci_dev *dev,
 		      struct msix_entry *entries, int nvec)
 { return -ENOSYS; }
+static inline int pci_alloc_irq_vectors(struct pci_dev *dev,
+		unsigned int min_vecs, unsigned int max_vecs,
+		unsigned int flags)
+{
+	if (min_vecs > 1)
+		return -ENOSPC;
+	dev->irqs = &dev->irq;
+	return 1;
+}
+static inline void pci_free_irq_vectors(struct pci_dev *dev)
+{
+}
 #endif
 
 #ifdef CONFIG_PCIEPORTBUS
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-25 20:22   ` Alexander Gordeev
  2016-06-14 19:59 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Set the affinity_mask before allocating vectors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/pci/msi.c   | 26 ++++++++++++++++++++++++--
 include/linux/pci.h |  1 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a33adec..50d694c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -568,6 +568,7 @@ static struct msi_desc *msi_setup_entry(struct pci_dev *dev, int nvec)
 	entry->msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
 	entry->msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
 	entry->nvec_used		= nvec;
+	entry->affinity			= dev->irq_affinity;
 
 	if (control & PCI_MSI_FLAGS_64BIT)
 		entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
@@ -679,10 +680,18 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			      struct msix_entry *entries, int nvec)
 {
+	const struct cpumask *mask = NULL;
 	struct msi_desc *entry;
-	int i;
+	int cpu = -1, i;
 
 	for (i = 0; i < nvec; i++) {
+		if (dev->irq_affinity) {
+			cpu = cpumask_next(cpu, dev->irq_affinity);
+			if (cpu >= nr_cpu_ids)
+				cpu = cpumask_first(dev->irq_affinity);
+			mask = cpumask_of(cpu);
+		}
+
 		entry = alloc_msi_entry(&dev->dev);
 		if (!entry) {
 			if (!i)
@@ -699,6 +708,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 		entry->msi_attrib.default_irq	= dev->irq;
 		entry->mask_base		= base;
 		entry->nvec_used		= 1;
+		entry->affinity			= mask;
 
 		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
 	}
@@ -1176,12 +1186,20 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 {
 	unsigned int vecs, i;
 	u32 *irqs;
+	int ret;
 
 	max_vecs = min(max_vecs, pci_nr_irq_vectors(dev));
 
+	ret = irq_create_affinity_mask(&dev->irq_affinity, &max_vecs);
+	if (ret)
+		return ret;
+	if (max_vecs < min_vecs)
+		return -ENOSPC;
+
+	ret = -ENOMEM;
 	irqs = kcalloc(max_vecs, sizeof(u32), GFP_KERNEL);
 	if (!irqs)
-		return -ENOMEM;
+		goto out_free_affinity;
 
 	if (!(flags & PCI_IRQ_NOMSIX)) {
 		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
@@ -1208,6 +1226,10 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 done:
 	dev->irqs = irqs;
 	return vecs;
+out_free_affinity:
+	kfree(dev->irq_affinity);
+	dev->irq_affinity = NULL;
+	return ret;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 84a20fc..f474611 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	 */
 	unsigned int	irq;
 	unsigned int	*irqs;
+	struct cpumask	*irq_affinity;
 	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
 
 	bool match_driver;		/* Skip attaching driver */
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-14 19:59 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue NULL until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bc7166d..f972d32 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2114,8 +2114,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
-
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
 	 * we should change hctx numa_node according to new topology (this
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering from the additional indirection.  The same
would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c         | 22 ++++++++++++++--------
 include/linux/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f972d32..622cb22 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1930,7 +1930,6 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
-	kfree(q->mq_map);
 	q->mq_map = NULL;
 
 	kfree(q->queue_hw_ctx);
@@ -2029,9 +2028,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
-	if (!q->mq_map)
-		goto err_map;
+	q->mq_map = set->mq_map;
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
@@ -2081,8 +2078,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
-	kfree(q->mq_map);
-err_map:
 	kfree(q->queue_hw_ctx);
 err_percpu:
 	free_percpu(q->queue_ctx);
@@ -2304,14 +2299,22 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
+	set->mq_map = blk_mq_make_queue_map(set);
+	if (!set->mq_map)
+		goto out_free_tags;
+
 	if (blk_mq_alloc_rq_maps(set))
-		goto enomem;
+		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
-enomem:
+
+out_free_mq_map:
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
 	return -ENOMEM;
@@ -2327,6 +2330,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
 
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+
 	kfree(set->tags);
 	set->tags = NULL;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2498fdf..0a3b138 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -65,6 +65,7 @@ struct blk_mq_hw_ctx {
 };
 
 struct blk_mq_tag_set {
+	unsigned int		*mq_map;
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-07-04  8:15   ` Alexander Gordeev
  2016-06-14 19:59 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in
behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  72 ++++++++++++++++++++++++++---
 block/blk-mq.h         |   8 ----
 include/linux/blk-mq.h |   1 +
 5 files changed, 69 insertions(+), 134 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 622cb22..6027a49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1954,6 +1955,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2253,6 +2270,30 @@ struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
 }
 EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 
+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+		const struct cpumask *affinity_mask)
+{
+	int queue = -1, cpu = 0;
+
+	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+			GFP_KERNEL, set->numa_node);
+	if (!set->mq_map)
+		return -ENOMEM;
+
+	if (!affinity_mask)
+		return 0;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue > 0)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2261,6 +2302,8 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
+	int ret;
+
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
 	if (!set->nr_hw_queues)
@@ -2299,11 +2342,30 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	set->mq_map = blk_mq_make_queue_map(set);
-	if (!set->mq_map)
-		goto out_free_tags;
+	/*
+	 * Use the passed in affinity mask if the driver provided one.
+	 */
+	if (set->affinity_mask) {
+		ret = blk_mq_create_mq_map(set, set->affinity_mask);
+		if (!set->mq_map)
+			goto out_free_tags;
+	} else {
+		struct cpumask *affinity_mask;
 
-	if (blk_mq_alloc_rq_maps(set))
+		ret = irq_create_affinity_mask(&affinity_mask,
+					       &set->nr_hw_queues);
+		if (ret)
+			goto out_free_tags;
+
+		ret = blk_mq_create_mq_map(set, affinity_mask);
+		kfree(affinity_mask);
+
+		if (!set->mq_map)
+			goto out_free_tags;
+	}
+
+	ret = blk_mq_alloc_rq_maps(set);
+	if (ret)
 		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
@@ -2317,7 +2379,7 @@ out_free_mq_map:
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..fe7e21f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,14 +45,6 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
- */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
-/*
  * sysfs helpers
  */
 extern int blk_mq_sysfs_register(struct request_queue *q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0a3b138..404cc86 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -75,6 +75,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-14 19:59 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

This sorts out MSI-X vs MSI vs legacy irq, as well as IRQ affinity
under the hood.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 89 +++++++++++++------------------------------------
 1 file changed, 23 insertions(+), 66 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index dc39924..96db711 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -88,7 +88,6 @@ struct nvme_dev {
 	unsigned max_qid;
 	int q_depth;
 	u32 db_stride;
-	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct remove_work;
@@ -201,6 +200,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+	return to_pci_dev(nvmeq->dev->dev)->irqs[nvmeq->cq_vector];
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 				unsigned int hctx_idx)
 {
@@ -954,7 +958,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 		spin_unlock_irq(&nvmeq->q_lock);
 		return 1;
 	}
-	vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+	vector = nvmeq_irq(nvmeq);
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -962,7 +966,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
 
 	return 0;
@@ -1073,11 +1076,11 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 							const char *name)
 {
 	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
-	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-				IRQF_SHARED, name, nvmeq);
+		return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+				nvme_irq, IRQF_SHARED, name, nvmeq);
+	else
+		return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+				name, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1376,7 +1379,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int result, i, vecs, nr_io_queues, size;
+	int result, nr_io_queues, size;
 
 	nr_io_queues = num_online_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1411,29 +1414,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	}
 
 	/* Deregister the admin queue's interrupt */
-	free_irq(dev->entry[0].vector, adminq);
+	free_irq(pdev->irqs[0], adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
-
-	for (i = 0; i < nr_io_queues; i++)
-		dev->entry[i].entry = i;
-	vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-	if (vecs < 0) {
-		vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-		if (vecs < 0) {
-			vecs = 1;
-		} else {
-			for (i = 0; i < vecs; i++)
-				dev->entry[i].vector = i + pdev->irq;
-		}
-	}
+	pci_free_irq_vectors(pdev);
+	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues, 0);
+	if (nr_io_queues <= 0)
+		return -EIO;
+	dev->max_qid = nr_io_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -1441,8 +1432,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * path to scale better, even if the receive path is limited by the
 	 * number of interrupts.
 	 */
-	nr_io_queues = vecs;
-	dev->max_qid = nr_io_queues;
 
 	result = queue_request_irq(dev, adminq, adminq->irqname);
 	if (result) {
@@ -1456,23 +1445,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static void nvme_pci_post_scan(struct nvme_ctrl *ctrl)
-{
-	struct nvme_dev *dev = to_nvme_dev(ctrl);
-	struct nvme_queue *nvmeq;
-	int i;
-
-	for (i = 0; i < dev->online_queues; i++) {
-		nvmeq = dev->queues[i];
-
-		if (!nvmeq->tags || !(*nvmeq->tags))
-			continue;
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-					blk_mq_tags_cpumask(*nvmeq->tags));
-	}
-}
-
 static void nvme_del_queue_end(struct request *req, int error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
@@ -1575,6 +1547,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.cmd_size = nvme_cmd_size(dev);
 		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->tagset.driver_data = dev;
+		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
@@ -1614,15 +1587,9 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 	 * interrupts. Pre-enable a single MSIX or MSI vec for setup. We'll
 	 * adjust this later.
 	 */
-	if (pci_enable_msix(pdev, dev->entry, 1)) {
-		pci_enable_msi(pdev);
-		dev->entry[0].vector = pdev->irq;
-	}
-
-	if (!dev->entry[0].vector) {
-		result = -ENODEV;
-		goto disable;
-	}
+	result = pci_alloc_irq_vectors(pdev, 1, 1, 0);
+	if (result < 0)
+		return result;
 
 	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 
@@ -1664,10 +1631,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
+	pci_free_irq_vectors(pdev);
 
 	if (pci_is_enabled(pdev)) {
 		pci_disable_pcie_error_reporting(pdev);
@@ -1736,7 +1700,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	if (dev->ctrl.admin_q)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 }
 
@@ -1879,7 +1842,6 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read64		= nvme_pci_reg_read64,
 	.reset_ctrl		= nvme_pci_reset_ctrl,
 	.free_ctrl		= nvme_pci_free_ctrl,
-	.post_scan		= nvme_pci_post_scan,
 	.submit_async_event	= nvme_pci_submit_async_event,
 };
 
@@ -1916,10 +1878,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
-							GFP_KERNEL, node);
-	if (!dev->entry)
-		goto free;
 	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
 							GFP_KERNEL, node);
 	if (!dev->queues)
@@ -1960,7 +1918,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_unmap(dev);
  free:
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 	return result;
 }
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 13/13] nvme: remove the post_scan callout
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
@ 2016-06-14 19:59 ` Christoph Hellwig
  2016-06-16  9:45 ` automatic interrupt affinity for MSI/MSI-X capable devices V2 Bart Van Assche
  2016-06-26 19:40 ` Alexander Gordeev
  14 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-14 19:59 UTC (permalink / raw)
  To: tglx, axboe; +Cc: linux-block, linux-pci, linux-nvme, linux-kernel

No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 3 ---
 drivers/nvme/host/nvme.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9d7cee4..b100cce 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1627,9 +1627,6 @@ static void nvme_scan_work(struct work_struct *work)
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 	mutex_unlock(&ctrl->namespaces_mutex);
 	kfree(id);
-
-	if (ctrl->ops->post_scan)
-		ctrl->ops->post_scan(ctrl);
 }
 
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 282421f..bb343b3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -150,7 +150,6 @@ struct nvme_ctrl_ops {
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 	int (*reset_ctrl)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
-	void (*post_scan)(struct nvme_ctrl *ctrl);
 	void (*submit_async_event)(struct nvme_ctrl *ctrl, int aer_idx);
 };
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-14 19:58 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
@ 2016-06-14 21:54   ` Guilherme G. Piccoli
  2016-06-15  8:35     ` Bart Van Assche
  2016-06-15 10:10     ` Christoph Hellwig
  2016-06-25 20:05   ` Alexander Gordeev
  1 sibling, 2 replies; 54+ messages in thread
From: Guilherme G. Piccoli @ 2016-06-14 21:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-kernel, linux-nvme, gpiccoli

On 06/14/2016 04:58 PM, Christoph Hellwig wrote:
> This is lifted from the blk-mq code and adopted to use the affinity mask
> concept just intruced in the irq handling code.

Very nice patch Christoph, thanks. There's a little typo above, on 
"intruced".

>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   include/linux/interrupt.h | 11 +++++++++
>   kernel/irq/Makefile       |  1 +
>   kernel/irq/affinity.c     | 60 +++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 72 insertions(+)
>   create mode 100644 kernel/irq/affinity.c
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 9fcabeb..12003c0 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
>   extern int
>   irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
>
> +int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs);
> +
>   #else /* CONFIG_SMP */
>
>   static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
> @@ -308,6 +311,14 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
>   {
>   	return 0;
>   }
> +
> +static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs)
> +{
> +	*affinity_mask = NULL;
> +	*nr_vecs = 1;
> +	return 0;
> +}
>   #endif /* CONFIG_SMP */
>
>   /*
> diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
> index 2ee42e9..1d3ee31 100644
> --- a/kernel/irq/Makefile
> +++ b/kernel/irq/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
>   obj-$(CONFIG_PM_SLEEP) += pm.o
>   obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
>   obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
> +obj-$(CONFIG_SMP) += affinity.o
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> new file mode 100644
> index 0000000..1daf8fb
> --- /dev/null
> +++ b/kernel/irq/affinity.c
> @@ -0,0 +1,60 @@
> +
> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/cpu.h>
> +
> +static int get_first_sibling(unsigned int cpu)
> +{
> +	unsigned int ret;
> +
> +	ret = cpumask_first(topology_sibling_cpumask(cpu));
> +	if (ret < nr_cpu_ids)
> +		return ret;
> +	return cpu;
> +}
> +
> +/*
> + * Take a map of online CPUs and the number of available interrupt vectors
> + * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
> + * so that they are distributed as good as possible around the CPUs.  If
> + * more vectors than CPUs are available we'll map one to each CPU,
> + * otherwise we map one to the first sibling of each socket.
> + *
> + * If there are more vectors than CPUs we will still only have one bit
> + * set per CPU, but interrupt code will keep on assining the vectors from
> + * the start of the bitmap until we run out of vectors.
> + */

Another little typo above in "assining".

I take this opportunity to ask you something, since I'm working in a 
related code in a specific driver - sorry in advance if my question is 
silly or if I misunderstood your code.

The function irq_create_affinity_mask() below deals with the case in 
which we have nr_vecs < num_online_cpus(); in this case, wouldn't be a 
good idea to trying distribute the vecs among cores?

Example: if we have 128 online cpus, 8 per core (meaning 16 cores) and 
64 vecs, I guess would be ideal to distribute 4 vecs _per core_, leaving 
4 CPUs in each core without vecs.

Makes sense for you?
Thanks,


Guilherme


> +int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs)
> +{
> +	unsigned int vecs = 0;
> +
> +	if (*nr_vecs == 1) {
> +		*affinity_mask = NULL;
> +		return 0;
> +	}
> +
> +	*affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
> +	if (!*affinity_mask)
> +		return -ENOMEM;
> +
> +	if (*nr_vecs >= num_online_cpus()) {
> +		cpumask_copy(*affinity_mask, cpu_online_mask);
> +	} else {
> +		unsigned int cpu;
> +
> +		for_each_online_cpu(cpu) {
> +			if (cpu == get_first_sibling(cpu)) {
> +				cpumask_set_cpu(cpu, *affinity_mask);
> +				vecs++;
> +			}
> +
> +			if (--(*nr_vecs) == 0)
> +				break;
> +		}
> +	}
> +
> +	*nr_vecs = vecs;
> +	return 0;
> +}
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-14 21:54   ` Guilherme G. Piccoli
@ 2016-06-15  8:35     ` Bart Van Assche
  2016-06-15 10:10     ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15  8:35 UTC (permalink / raw)
  To: Guilherme G. Piccoli, Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-kernel, linux-nvme

On 06/14/2016 11:54 PM, Guilherme G. Piccoli wrote:
> On 06/14/2016 04:58 PM, Christoph Hellwig wrote:
> I take this opportunity to ask you something, since I'm working in a
> related code in a specific driver - sorry in advance if my question is
> silly or if I misunderstood your code.
>
> The function irq_create_affinity_mask() below deals with the case in
> which we have nr_vecs < num_online_cpus(); in this case, wouldn't be a
> good idea to trying distribute the vecs among cores?
>
> Example: if we have 128 online cpus, 8 per core (meaning 16 cores) and
> 64 vecs, I guess would be ideal to distribute 4 vecs _per core_, leaving
> 4 CPUs in each core without vecs.

Hello Christoph and Guilherme,

I would also like to see irq_create_affinity_mask() modified such that 
it implements Guilherme's algorithm. I think blk-mq requests should be 
processed by a CPU core from the NUMA node from which the request has 
been submitted. With the proposed algorithm, if the number of MSI-X 
vectors is less than or equal to the number of CPU cores of a single 
node, all interrupt vectors will be assigned to the first NUMA node.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-14 19:58 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
@ 2016-06-15  8:44   ` Bart Van Assche
  2016-06-15 10:23     ` Christoph Hellwig
  2016-06-16  9:08   ` Bart Van Assche
  1 sibling, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15  8:44 UTC (permalink / raw)
  To: Christoph Hellwig, tglx, axboe
  Cc: linux-block, linux-pci, linux-nvme, linux-kernel

On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Interupts marked with this flag are excluded from user space interrupt
> affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
> affinity mechanism is not blocked.
>
> This flag will be used for multi-queue device interrupts.

It's great to see that the goal of this patch series is to configure 
interrupt affinity automatically for adapters that support multiple 
MSI-X vectors. However, is excluding these interrupts from irqbalanced 
really the way to go? Suppose e.g. that a system is equipped with two 
RDMA adapters, that these adapters are used by a blk-mq enabled block 
initiator driver and that each adapter supports eight MSI-X vectors. 
Should the interrupts of the two RDMA adapters be assigned to different 
CPU cores? If so, which software layer should realize this? The kernel 
or user space?

Sorry that I missed the first version of this patch series.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-14 21:54   ` Guilherme G. Piccoli
  2016-06-15  8:35     ` Bart Van Assche
@ 2016-06-15 10:10     ` Christoph Hellwig
  2016-06-15 13:09       ` Guilherme G. Piccoli
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-15 10:10 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-kernel, linux-nvme

On Tue, Jun 14, 2016 at 06:54:22PM -0300, Guilherme G. Piccoli wrote:
> On 06/14/2016 04:58 PM, Christoph Hellwig wrote:
>> This is lifted from the blk-mq code and adopted to use the affinity mask
>> concept just intruced in the irq handling code.
>
> Very nice patch Christoph, thanks. There's a little typo above, on 
> "intruced".

fixed.

> Another little typo above in "assining".

fixed a swell.

> I take this opportunity to ask you something, since I'm working in a 
> related code in a specific driver

Which driver?  One of the points here is to get this sort of code out
of drivers and into common code..

> - sorry in advance if my question is 
> silly or if I misunderstood your code.
>
> The function irq_create_affinity_mask() below deals with the case in which 
> we have nr_vecs < num_online_cpus(); in this case, wouldn't be a good idea 
> to trying distribute the vecs among cores?
>
> Example: if we have 128 online cpus, 8 per core (meaning 16 cores) and 64 
> vecs, I guess would be ideal to distribute 4 vecs _per core_, leaving 4 
> CPUs in each core without vecs.

There have been some reports about the blk-mq IRQ distribution being
suboptimal, but no one sent patches so far.  This patch just moves the
existing algorithm into the core code to be better bisectable.

I think an algorithm that takes cores into account instead of just SMT
sibling would be very useful.  So if you have a case where this helps
for you an incremental patch (or even one against the current blk-mq
code for now) would be appreciated.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15  8:44   ` Bart Van Assche
@ 2016-06-15 10:23     ` Christoph Hellwig
  2016-06-15 10:42       ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-15 10:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

Hi Bart,

On Wed, Jun 15, 2016 at 10:44:37AM +0200, Bart Van Assche wrote:
> However, is excluding these interrupts from irqbalanced really the 
> way to go?

What positive effect will irqbalanced have on explcititly spread
interrupts?

> Suppose e.g. that a system is equipped with two RDMA adapters, 
> that these adapters are used by a blk-mq enabled block initiator driver and 
> that each adapter supports eight MSI-X vectors. Should the interrupts of 
> the two RDMA adapters be assigned to different CPU cores? If so, which 
> software layer should realize this? The kernel or user space?

RDMA should eventually use the interrupt spreading implemented in this
series, as should networking (RDMA actually is on my near term todo list).

RDMA block protocols will then pick up the queue information from the
HCA driver.  I've not actually implemented this yet, but my current idea
is:

 - the HCA drivers are switch to use pci_alloc_irq_vectors to spread
   their interrupt vectors around the system
 - the HCA drivers will expose the irq_affinity affinity array
   in struct ib_device (we'll need to consider what do about the
   odd completion vectors instead of irq terminology in the RDMA stack,
   but that's not a show stopper)
 - multiqueue aware block drivers will then feed the irq_affinity
   cpumask from the hca driver to blk-mq.  We'll also need to ensure
   the number of protocol queues aligns nicely to the number of hardware
   queues.  My current thinking is that they should be the same or
   a fraction of the hardware completion queues, but this might need
   some careful benchmarking.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 10:23     ` Christoph Hellwig
@ 2016-06-15 10:42       ` Bart Van Assche
  2016-06-15 15:14         ` Keith Busch
  0 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15 10:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On 06/15/2016 12:23 PM, Christoph Hellwig wrote:
> Hi Bart,
>
> On Wed, Jun 15, 2016 at 10:44:37AM +0200, Bart Van Assche wrote:
>> However, is excluding these interrupts from irqbalanced really the
>> way to go?
>
> What positive effect will irqbalanced have on explcititly spread
> interrupts?
>
>> Suppose e.g. that a system is equipped with two RDMA adapters,
>> that these adapters are used by a blk-mq enabled block initiator driver and
>> that each adapter supports eight MSI-X vectors. Should the interrupts of
>> the two RDMA adapters be assigned to different CPU cores? If so, which
>> software layer should realize this? The kernel or user space?
>
> RDMA should eventually use the interrupt spreading implemented in this
> series, as should networking (RDMA actually is on my near term todo list).
>
> RDMA block protocols will then pick up the queue information from the
> HCA driver.  I've not actually implemented this yet, but my current idea
> is:
>
>  - the HCA drivers are switch to use pci_alloc_irq_vectors to spread
>    their interrupt vectors around the system
>  - the HCA drivers will expose the irq_affinity affinity array
>    in struct ib_device (we'll need to consider what do about the
>    odd completion vectors instead of irq terminology in the RDMA stack,
>    but that's not a show stopper)
>  - multiqueue aware block drivers will then feed the irq_affinity
>    cpumask from the hca driver to blk-mq.  We'll also need to ensure
>    the number of protocol queues aligns nicely to the number of hardware
>    queues.  My current thinking is that they should be the same or
>    a fraction of the hardware completion queues, but this might need
>    some careful benchmarking.

Hello Christoph,

Today irqbalanced is responsible for deciding how to assign interrupts 
from different adapters to CPU cores. Does the above mean that for 
adapters that support multiple MSI-X interrupts the kernel will have 
full responsibility for assigning interrupt vectors to CPU cores?

If two identical adapters are present in a system, will these generate 
the same irq_affinity mask? Do you agree that interrupt vectors from 
different adapters should be assigned to different CPU cores if enough 
CPU cores are available? If so, which software layer will assign 
interrupt vectors from different adapters to different CPU cores?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-15 10:10     ` Christoph Hellwig
@ 2016-06-15 13:09       ` Guilherme G. Piccoli
  2016-06-16 15:16         ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Guilherme G. Piccoli @ 2016-06-15 13:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, linux-pci, linux-kernel, linux-nvme, axboe, tglx,
	bart.vanassche

Thanks for the responses Bart and Christoph.


On 06/15/2016 07:10 AM, Christoph Hellwig wrote:
> On Tue, Jun 14, 2016 at 06:54:22PM -0300, Guilherme G. Piccoli wrote:
>> On 06/14/2016 04:58 PM, Christoph Hellwig wrote:
>>> This is lifted from the blk-mq code and adopted to use the affinity mask
>>> concept just intruced in the irq handling code.
>>
>> Very nice patch Christoph, thanks. There's a little typo above, on
>> "intruced".
>
> fixed.
>
>> Another little typo above in "assining".
>
> fixed a swell.
>
>> I take this opportunity to ask you something, since I'm working in a
>> related code in a specific driver
>
> Which driver?  One of the points here is to get this sort of code out
> of drivers and into common code..

A network driver, i40e. I'd be glad to implement/see some common code to 
raise the topology information I need, but I was implementing on i40e 
more as a test case/toy example heheh...


>> - sorry in advance if my question is
>> silly or if I misunderstood your code.
>>
>> The function irq_create_affinity_mask() below deals with the case in which
>> we have nr_vecs < num_online_cpus(); in this case, wouldn't be a good idea
>> to trying distribute the vecs among cores?
>>
>> Example: if we have 128 online cpus, 8 per core (meaning 16 cores) and 64
>> vecs, I guess would be ideal to distribute 4 vecs _per core_, leaving 4
>> CPUs in each core without vecs.
>
> There have been some reports about the blk-mq IRQ distribution being
> suboptimal, but no one sent patches so far.  This patch just moves the
> existing algorithm into the core code to be better bisectable.
>
> I think an algorithm that takes cores into account instead of just SMT
> sibling would be very useful.  So if you have a case where this helps
> for you an incremental patch (or even one against the current blk-mq
> code for now) would be appreciated.

...but now I'll focus on the common/general case! Thanks for the 
suggestion Christoph. I guess would be even better to have a generic 
function that retrieves an optimal mask, something like 
topology_get_optimal_mask(n, *cpumask), in which we get the best 
distribution of n CPUs among all cores and return such a mask - 
interesting case is when n < num_online_cpus. So, this function could be 
used inside your irq_create_affinity_mask() and maybe in other places it 
is needed.

I was planning to use topology_core_id() to retrieve the core of a CPU, 
if anybody has a better idea, I'd be glad to hear it.

Cheers,


Guilherme


>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 10:42       ` Bart Van Assche
@ 2016-06-15 15:14         ` Keith Busch
  2016-06-15 15:28           ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Keith Busch @ 2016-06-15 15:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 12:42:53PM +0200, Bart Van Assche wrote:
> Today irqbalanced is responsible for deciding how to assign interrupts from
> different adapters to CPU cores. Does the above mean that for adapters that
> support multiple MSI-X interrupts the kernel will have full responsibility
> for assigning interrupt vectors to CPU cores?

Hi Bart,

Right, the kernel would be responsible for assigning interrupt vectors to
cores. The kernel is already responsible for setting the affinity hint,
but we want direct control because we can do a better than irqbalance,
which has been a problem point for users.

Many adapters gain significant performance when irqbalance is using
"exact" hint policy. But that's not irqbalance's default setting, and
we don't necessarily want to enforce "exact" on the entire system when
only a subset of devices benefit from such a setup.
 
> If two identical adapters are present in a system, will these generate the
> same irq_affinity mask? Do you agree that interrupt vectors from different
> adapters should be assigned to different CPU cores if enough CPU cores are
> available? If so, which software layer will assign interrupt vectors from
> different adapters to different CPU cores?

I think the idea is have the irq_affinity mask match the CPU mapping on
the submission side context associated with that particular vector. If
two identical adapters generate the same submission CPU mapping, I don't
think we can do better than matching irq_affinity masks.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 15:14         ` Keith Busch
@ 2016-06-15 15:28           ` Bart Van Assche
  2016-06-15 16:03             ` Keith Busch
  0 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15 15:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On 06/15/2016 05:14 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 12:42:53PM +0200, Bart Van Assche wrote:
>> If two identical adapters are present in a system, will these generate the
>> same irq_affinity mask? Do you agree that interrupt vectors from different
>> adapters should be assigned to different CPU cores if enough CPU cores are
>> available? If so, which software layer will assign interrupt vectors from
>> different adapters to different CPU cores?
>
> I think the idea is have the irq_affinity mask match the CPU mapping on
> the submission side context associated with that particular vector. If
> two identical adapters generate the same submission CPU mapping, I don't
> think we can do better than matching irq_affinity masks.

Has this been verified by measurements? Sorry but I'm not convinced that 
using the same mapping for multiple identical adapters instead of 
spreading interrupts will result in better performance.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 15:28           ` Bart Van Assche
@ 2016-06-15 16:03             ` Keith Busch
  2016-06-15 19:36               ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Keith Busch @ 2016-06-15 16:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 05:28:54PM +0200, Bart Van Assche wrote:
> On 06/15/2016 05:14 PM, Keith Busch wrote:
> >I think the idea is have the irq_affinity mask match the CPU mapping on
> >the submission side context associated with that particular vector. If
> >two identical adapters generate the same submission CPU mapping, I don't
> >think we can do better than matching irq_affinity masks.
> 
> Has this been verified by measurements? Sorry but I'm not convinced that
> using the same mapping for multiple identical adapters instead of spreading
> interrupts will result in better performance.

The interrupts automatically spread based on which CPU submitted the
work. If you want to spread interrupts across more CPUs, then you can
spread submissions to the CPUs you want to service the interrupts.

Completing work on the same CPU that submitted it is quickest with
its cache hot access. I have equipment available to demo this. What
affinty_mask policy would you like to see compared with the proposal?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 16:03             ` Keith Busch
@ 2016-06-15 19:36               ` Bart Van Assche
  2016-06-15 20:06                 ` Keith Busch
  2016-06-16 15:20                 ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15 19:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On 06/15/2016 06:03 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 05:28:54PM +0200, Bart Van Assche wrote:
>> On 06/15/2016 05:14 PM, Keith Busch wrote:
>>> I think the idea is have the irq_affinity mask match the CPU mapping on
>>> the submission side context associated with that particular vector. If
>>> two identical adapters generate the same submission CPU mapping, I don't
>>> think we can do better than matching irq_affinity masks.
>>
>> Has this been verified by measurements? Sorry but I'm not convinced that
>> using the same mapping for multiple identical adapters instead of spreading
>> interrupts will result in better performance.
>
> The interrupts automatically spread based on which CPU submitted the
> work. If you want to spread interrupts across more CPUs, then you can
> spread submissions to the CPUs you want to service the interrupts.
>
> Completing work on the same CPU that submitted it is quickest with
> its cache hot access. I have equipment available to demo this. What
> affinty_mask policy would you like to see compared with the proposal?

Hello Keith,

Sorry that I had not yet this made this clear but my concern is about a 
system equipped with two or more adapters and with more CPU cores than 
the number of MSI-X interrupts per adapter. Consider e.g. a system with 
two adapters (A and B), 8 interrupts per adapter (A0..A7 and B0..B7), 32 
CPU cores and two NUMA nodes. Assuming that hyperthreading is disabled, 
will the patches from this patch series generate the following interrupt 
assignment?

0: A0 B0
1: A1 B1
2: A2 B2
3: A3 B3
4: A4 B4
5: A5 B5
6: A6 B6
7: A7 B7
8: (none)
...
31: (none)

The mapping I would like to see is as follows (assuming CPU cores 0..15 
correspond to NUMA node 0 and CPU cores 16..31 correspond to NUMA node 1):

0: A0
1: B0
2: (none)
3: (none)
4: A1
5: B1
6: (none)
7: (none)
8: A2
9: B2
10: (none)
11: (none)
12: A3
13: B3
14: (none)
15: (none)
...
31: (none)

Do you agree that - ignoring other interrupt assignments - that the 
latter interrupt assignment scheme would result in higher throughput and 
lower interrupt processing latency?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 19:36               ` Bart Van Assche
@ 2016-06-15 20:06                 ` Keith Busch
  2016-06-15 20:12                   ` Keith Busch
  2016-06-16 15:20                 ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Keith Busch @ 2016-06-15 20:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
> Sorry that I had not yet this made this clear but my concern is about a
> system equipped with two or more adapters and with more CPU cores than the
> number of MSI-X interrupts per adapter. Consider e.g. a system with two
> adapters (A and B), 8 interrupts per adapter (A0..A7 and B0..B7), 32 CPU
> cores and two NUMA nodes. Assuming that hyperthreading is disabled, will the
> patches from this patch series generate the following interrupt assignment?
> 
> 0: A0 B0
> 1: A1 B1
> 2: A2 B2
> 3: A3 B3
> 4: A4 B4
> 5: A5 B5
> 6: A6 B6
> 7: A7 B7
> 8: (none)
> ...
> 31: (none)

I'll need to look at the follow on patches do to confirm, but that's
not what this should do. All CPU's should have a vector assigned because
every CPU needs to be assigned a submission context using a vector. In
your example, every vector's affinity mask should be assigned to 4 CPUs:
vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.

If it's done such that all CPUs are assigned and no sharing occurs across
NUMA nodes, does that change your concern?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 20:06                 ` Keith Busch
@ 2016-06-15 20:12                   ` Keith Busch
  2016-06-15 20:50                     ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Keith Busch @ 2016-06-15 20:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 04:06:55PM -0400, Keith Busch wrote:
> > 
> > 0: A0 B0
> > 1: A1 B1
> > 2: A2 B2
> > 3: A3 B3
> > 4: A4 B4
> > 5: A5 B5
> > 6: A6 B6
> > 7: A7 B7
> > 8: (none)
> > ...
> > 31: (none)
> 
> I'll need to look at the follow on patches do to confirm, but that's
> not what this should do. All CPU's should have a vector assigned because
> every CPU needs to be assigned a submission context using a vector. In
> your example, every vector's affinity mask should be assigned to 4 CPUs:
> vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.

  ^^^^^^

Sorry, I meant "CPU '8'", not "vector '8'".

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 20:12                   ` Keith Busch
@ 2016-06-15 20:50                     ` Bart Van Assche
  2016-06-16 15:19                       ` Keith Busch
  0 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-15 20:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On 06/15/2016 10:12 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 04:06:55PM -0400, Keith Busch wrote:
>>>
>>> 0: A0 B0
>>> 1: A1 B1
>>> 2: A2 B2
>>> 3: A3 B3
>>> 4: A4 B4
>>> 5: A5 B5
>>> 6: A6 B6
>>> 7: A7 B7
>>> 8: (none)
>>> ...
>>> 31: (none)
>>
>> I'll need to look at the follow on patches do to confirm, but that's
>> not what this should do. All CPU's should have a vector assigned because
>> every CPU needs to be assigned a submission context using a vector. In
>> your example, every vector's affinity mask should be assigned to 4 CPUs:
>> vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.
>
>   ^^^^^^
>
> Sorry, I meant "CPU '8'", not "vector '8'".

Hello Keith,

Does it matter on x86 systems whether or not these interrupt vectors are 
also associated with a CPU with a higher CPU number? Although multiple 
bits can be set in /proc/irq/<n>/smp_affinity only the first bit counts 
on x86 platforms. In default_cpu_mask_to_apicid_and() it is easy to see 
that only the first bit that has been set in that mask counts on x86 
systems.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
  2016-06-14 19:58 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
@ 2016-06-16  9:05   ` Bart Van Assche
  0 siblings, 0 replies; 54+ messages in thread
From: Bart Van Assche @ 2016-06-16  9:05 UTC (permalink / raw)
  To: Christoph Hellwig, tglx, axboe
  Cc: linux-block, linux-pci, linux-nvme, linux-kernel

On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> No user and we definitely don't want to grow one.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-14 19:58 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
  2016-06-15  8:44   ` Bart Van Assche
@ 2016-06-16  9:08   ` Bart Van Assche
  1 sibling, 0 replies; 54+ messages in thread
From: Bart Van Assche @ 2016-06-16  9:08 UTC (permalink / raw)
  To: Thomas Gleixner, Christoph Hellwig, axboe
  Cc: linux-block, linux-pci, linux-nvme, linux-kernel

On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index 4d758a7..49d66d1 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -197,6 +197,7 @@ struct irq_data {
>   * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
>   * IRQD_WAKEUP_ARMED		- Wakeup mode armed
>   * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
> + * IRQD_AFFINITY_MANAGED	- Affinity is managed automatically
>   */

Does "managed automatically" mean managed by software? If so, I think it 
would help to mention which software manages IRQ affinity if the 
IRQD_AFFINITY_MANAGED flag has been set.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: automatic interrupt affinity for MSI/MSI-X capable devices V2
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2016-06-14 19:59 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
@ 2016-06-16  9:45 ` Bart Van Assche
  2016-06-16 15:22   ` Christoph Hellwig
  2016-06-26 19:40 ` Alexander Gordeev
  14 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-16  9:45 UTC (permalink / raw)
  To: Christoph Hellwig, tglx, axboe
  Cc: linux-block, linux-pci, linux-nvme, linux-kernel

On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> This series enhances the irq and PCI code to allow spreading around MSI and
> MSI-X vectors so that they have per-cpu affinity if possible, or at least
> per-node.  For that it takes the algorithm from blk-mq, moves it to
> a common place, and makes it available through a vastly simplified PCI
> interrupt allocation API.  It then switches blk-mq to be able to pick up
> the queue mapping from the device if available, and demonstrates all this
> using the NVMe driver.

Hello Christoph,

Is my interpretation correct that for an adapter that supports two 
interrupts and on a system with eight CPU cores and no hyperthreading 
this patch series will assign interrupt vector 0 to CPU 0 and interrupt 
vector 1 to CPU 1? Are you aware that drivers like ib_srp assume that 
interrupts have been spread evenly, that means assigning vector 0 to CPU 
0 and vector 1 to CPU 4?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-15 13:09       ` Guilherme G. Piccoli
@ 2016-06-16 15:16         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-16 15:16 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: Christoph Hellwig, linux-block, linux-pci, linux-kernel,
	linux-nvme, axboe, tglx, bart.vanassche

>
> ...but now I'll focus on the common/general case! Thanks for the suggestion 
> Christoph. I guess would be even better to have a generic function that 
> retrieves an optimal mask, something like topology_get_optimal_mask(n, 
> *cpumask), in which we get the best distribution of n CPUs among all cores 
> and return such a mask - interesting case is when n < num_online_cpus. So, 
> this function could be used inside your irq_create_affinity_mask() and 
> maybe in other places it is needed.

Yes, we should probably just plug this in where we're using the current
routines.  Block very much optimizes for the cases of either 1 queue
or enough queues for all cpus at the moments.  It would be good to check
what the network drivers currently do.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 20:50                     ` Bart Van Assche
@ 2016-06-16 15:19                       ` Keith Busch
  2016-06-22 11:56                         ` Alexander Gordeev
  0 siblings, 1 reply; 54+ messages in thread
From: Keith Busch @ 2016-06-16 15:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 10:50:53PM +0200, Bart Van Assche wrote:
> Does it matter on x86 systems whether or not these interrupt vectors are
> also associated with a CPU with a higher CPU number? Although multiple bits
> can be set in /proc/irq/<n>/smp_affinity only the first bit counts on x86
> platforms. In default_cpu_mask_to_apicid_and() it is easy to see that only
> the first bit that has been set in that mask counts on x86 systems.

Wow, thanks for the information. I didn't know the apic wasn't using
the full cpu mask, so this changes how I need to look at this, and will
experiment with such a configuration.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-15 19:36               ` Bart Van Assche
  2016-06-15 20:06                 ` Keith Busch
@ 2016-06-16 15:20                 ` Christoph Hellwig
  2016-06-16 15:39                   ` Bart Van Assche
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-16 15:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Christoph Hellwig, tglx, axboe, linux-block,
	linux-pci, linux-nvme, linux-kernel

On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
> Do you agree that - ignoring other interrupt assignments - that the latter 
> interrupt assignment scheme would result in higher throughput and lower 
> interrupt processing latency?

Probably.  Once we've got it in the core IRQ code we can tweak the
algorithm to be optimal.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: automatic interrupt affinity for MSI/MSI-X capable devices V2
  2016-06-16  9:45 ` automatic interrupt affinity for MSI/MSI-X capable devices V2 Bart Van Assche
@ 2016-06-16 15:22   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-16 15:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Thu, Jun 16, 2016 at 11:45:55AM +0200, Bart Van Assche wrote:
> Is my interpretation correct that for an adapter that supports two 
> interrupts and on a system with eight CPU cores and no hyperthreading this 
> patch series will assign interrupt vector 0 to CPU 0 and interrupt vector 1 
> to CPU 1?

Yes - same as the existing blk-mq queue distribution.

> Are you aware that drivers like ib_srp assume that interrupts 
> have been spread evenly, that means assigning vector 0 to CPU 0 and vector 
> 1 to CPU 4?

which will make them run into a conflict with the current blk-mq
assignment.  That's exactly the point why we're trying to move things
to a core location so that everyone can use the same mapping.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-16 15:20                 ` Christoph Hellwig
@ 2016-06-16 15:39                   ` Bart Van Assche
  2016-06-20 12:22                     ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-16 15:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, tglx, axboe, linux-block, linux-pci, linux-nvme,
	linux-kernel

On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>> Do you agree that - ignoring other interrupt assignments - that the latter
>> interrupt assignment scheme would result in higher throughput and lower
>> interrupt processing latency?
>
> Probably.  Once we've got it in the core IRQ code we can tweak the
> algorithm to be optimal.

Sorry but I'm afraid that we are embedding policy in the kernel, 
something we should not do. I know that there are workloads for which 
dedicating some CPU cores to interrupt processing and other CPU cores to 
running kernel threads improves throughput, probably because this 
results in less cache eviction on the CPU cores that run kernel threads 
and some degree of interrupt coalescing on the CPU cores that process 
interrupts. My concern is that I doubt that there is an interrupt 
assignment scheme that works optimally for all workloads. Hence my 
request to preserve the ability to modify interrupt affinity from user 
space.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-16 15:39                   ` Bart Van Assche
@ 2016-06-20 12:22                     ` Christoph Hellwig
  2016-06-20 13:21                       ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-20 12:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, tglx, axboe, linux-block,
	linux-pci, linux-nvme, linux-kernel

On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote:
> On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
>> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>>> Do you agree that - ignoring other interrupt assignments - that the latter
>>> interrupt assignment scheme would result in higher throughput and lower
>>> interrupt processing latency?
>>
>> Probably.  Once we've got it in the core IRQ code we can tweak the
>> algorithm to be optimal.
>
> Sorry but I'm afraid that we are embedding policy in the kernel, something 
> we should not do. I know that there are workloads for which dedicating some 
> CPU cores to interrupt processing and other CPU cores to running kernel 
> threads improves throughput, probably because this results in less cache 
> eviction on the CPU cores that run kernel threads and some degree of 
> interrupt coalescing on the CPU cores that process interrupts.

And you can still easily set this use case up by chosing less queues
(aka interrupts) than CPUs and assining your workload to the other
cores.

> My concern 
> is that I doubt that there is an interrupt assignment scheme that works 
> optimally for all workloads. Hence my request to preserve the ability to 
> modify interrupt affinity from user space.

I'd say let's do such an interface incrementall based on the use
case - especially after we get networking over to use common code
to distribute the interrupts.  If you were doing something like this
with the current blk-mq code it wouldn't work very well due to the
fact that you'd have a mismatch between the assigned interrupt and
the blk-mq queue mapping anyway.

It might be a good idea to start brainstorming how we'd want to handle
this change - we'd basically need a per-device notification that the
interrupt mapping changes so that we can rebuild the queue mapping,
which is somewhat similar to the lib/cpu_rmap.c code used by a few
networking drivers.  This would also help with dealing with cpu
hotplug events that change the cpu mapping.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-20 12:22                     ` Christoph Hellwig
@ 2016-06-20 13:21                       ` Bart Van Assche
  2016-06-21 14:31                         ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Bart Van Assche @ 2016-06-20 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, tglx, axboe, linux-block, linux-pci, linux-nvme,
	linux-kernel

On 06/20/2016 02:22 PM, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote:
>> On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
>>> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>> My concern
>> is that I doubt that there is an interrupt assignment scheme that works
>> optimally for all workloads. Hence my request to preserve the ability to
>> modify interrupt affinity from user space.
>
> I'd say let's do such an interface incrementally based on the use
> case - especially after we get networking over to use common code
> to distribute the interrupts.  If you were doing something like this
> with the current blk-mq code it wouldn't work very well due to the
> fact that you'd have a mismatch between the assigned interrupt and
> the blk-mq queue mapping anyway.
>
> It might be a good idea to start brainstorming how we'd want to handle
> this change - we'd basically need a per-device notification that the
> interrupt mapping changes so that we can rebuild the queue mapping,
> which is somewhat similar to the lib/cpu_rmap.c code used by a few
> networking drivers.  This would also help with dealing with cpu
> hotplug events that change the cpu mapping.

A notification mechanism that reports interrupt mapping changes will 
definitely help. What would also help is an API that allows drivers to 
query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. 
hctx->cpumask. Another function can then map that IRQ into an index in 
the range 0..n-1 where n is the number of MSI-X interrupts for that 
adapter. Every blk-mq/scsi-mq driver will need this functionality to 
decide which IRQ to associate with a block layer hctx.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-20 13:21                       ` Bart Van Assche
@ 2016-06-21 14:31                         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-21 14:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, tglx, axboe, linux-block,
	linux-pci, linux-nvme, linux-kernel

On Mon, Jun 20, 2016 at 03:21:47PM +0200, Bart Van Assche wrote:
> A notification mechanism that reports interrupt mapping changes will 
> definitely help. What would also help is an API that allows drivers to 
> query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. 
> hctx->cpumask.

This is still the wrong way around - we need to build the blk-mq queue
mappings based on the interrupts, not the other way around.

> Another function can then map that IRQ into an index in the 
> range 0..n-1 where n is the number of MSI-X interrupts for that adapter. 
> Every blk-mq/scsi-mq driver will need this functionality to decide which 
> IRQ to associate with a block layer hctx.

This is something that should be done in commmon code and is done in
common code in this series - the driver passes a cpumask to blk-mq,
and blk-mq creates a queue for every cpu that is set in the cpumask.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
  2016-06-16 15:19                       ` Keith Busch
@ 2016-06-22 11:56                         ` Alexander Gordeev
  0 siblings, 0 replies; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-22 11:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Christoph Hellwig, tglx, axboe, linux-block,
	linux-pci, linux-nvme, linux-kernel

On Thu, Jun 16, 2016 at 11:19:51AM -0400, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 10:50:53PM +0200, Bart Van Assche wrote:
> > Does it matter on x86 systems whether or not these interrupt vectors are
> > also associated with a CPU with a higher CPU number? Although multiple bits
> > can be set in /proc/irq/<n>/smp_affinity only the first bit counts on x86
> > platforms. In default_cpu_mask_to_apicid_and() it is easy to see that only
> > the first bit that has been set in that mask counts on x86 systems.
> 
> Wow, thanks for the information. I didn't know the apic wasn't using
> the full cpu mask, so this changes how I need to look at this, and will
> experiment with such a configuration.

I have vague memories of this, but you probably need to check PPC as well.
Its interrupt distribution is not straightforward as well, AFAIR.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-06-14 19:59 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
@ 2016-06-23 11:16   ` Alexander Gordeev
  2016-06-30 16:54     ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-23 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:59:00PM +0200, Christoph Hellwig wrote:
> Add a helper to allocate a range of interrupt vectors, which will
> transparently use MSI-X and MSI if available or fallback to legacy
> vectors.  The interrupts are available in a core managed array
> in the pci_dev structure, and can also be released using a similar
> helper.
> 
> The next patch will also add automatic spreading of MSI / MSI-X
> vectors to this function.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/pci/msi.c   | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h |  18 +++++++++

New APIs should be documented in Documentation/PCI/MSI-HOWTO.txt, I guess.

>  2 files changed, 128 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index a080f44..a33adec 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -4,6 +4,7 @@
>   *
>   * Copyright (C) 2003-2004 Intel
>   * Copyright (C) Tom Long Nguyen (tom.l.nguyen@intel.com)
> + * Copyright (c) 2016 Christoph Hellwig.
>   */
>  
>  #include <linux/err.h>
> @@ -1120,6 +1121,115 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
>  }
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
> +static unsigned int pci_nr_irq_vectors(struct pci_dev *pdev)
> +{
> +	int nr_entries;
> +
> +	nr_entries = pci_msix_vec_count(pdev);
> +	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
> +		nr_entries = pci_msi_vec_count(pdev);
> +	if (nr_entries <= 0)
> +		nr_entries = 1;
> +	return nr_entries;
> +}

This function is strange, because it:
  (a) does not consider PCI_IRQ_NOMSIX flag;
  (b) only calls pci_msi_supported() for MSI case;
  (c) calls pci_msi_supported() with just one vector;
  (d) might return suboptimal number of vectors (number of MSI-X used 
      later for MSI or vice versa)

Overall, I would suggest simply return maximum between MSI-X and MSI
numbers and let the rest of the code (i.e the two range functions)
handle a-d.

> +static int pci_enable_msix_range_wrapper(struct pci_dev *pdev, u32 *irqs,
> +		unsigned int min_vecs, unsigned int max_vecs)
> +{
> +	struct msix_entry *msix_entries;
> +	int vecs, i;
> +
> +	msix_entries = kcalloc(max_vecs, sizeof(struct msix_entry), GFP_KERNEL);
> +	if (!msix_entries)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < max_vecs; i++)
> +		msix_entries[i].entry = i;
> +
> +	vecs = pci_enable_msix_range(pdev, msix_entries, min_vecs, max_vecs);
> +	if (vecs > 0) {

This condition check is unneeded.

> +		for (i = 0; i < vecs; i++)
> +			irqs[i] = msix_entries[i].vector;
> +	}
> +
> +	kfree(msix_entries);
> +	return vecs;
> +}
> +
> +/**
> + * pci_alloc_irq_vectors - allocate multiple IRQs for a device
> + * @dev:		PCI device to operate on
> + * @min_vecs:		minimum number of vectors required (must be >= 1)
> + * @max_vecs:		maximum (desired) number of vectors
> + * @flags:		flags or quirks for the allocation
> + *
> + * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X or MSI
> + * vectors if available, and fall back to a single legacy vector
> + * if neither is available.  Return the number of vectors allocated,
> + * (which might be smaller than @max_vecs) if successful, or a negative
> + * error code on error.  The Linux irq numbers for the allocated
> + * vectors are stored in pdev->irqs.  If less than @min_vecs interrupt
> + * vectors are available for @dev the function will fail with -ENOSPC.
> + */
> +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> +		unsigned int max_vecs, unsigned int flags)
> +{
> +	unsigned int vecs, i;
> +	u32 *irqs;
> +
> +	max_vecs = min(max_vecs, pci_nr_irq_vectors(dev));

Optionally, you could move this assignment to  pci_nr_irq_vectors() and
simply let it handle number of vectors to request.

> +	irqs = kcalloc(max_vecs, sizeof(u32), GFP_KERNEL);
> +	if (!irqs)
> +		return -ENOMEM;
> +
> +	if (!(flags & PCI_IRQ_NOMSIX)) {
> +		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
> +				max_vecs);
> +		if (vecs > 0)
> +			goto done;
> +	}
> +
> +	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> +	if (vecs > 0) {
> +		for (i = 0; i < vecs; i++)
> +			irqs[i] = dev->irq + i;
> +		goto done;
> +	}
> +
> +	if (min_vecs > 1)
> +		return -ENOSPC;

irqs is leaked if (min_vecs > 1)

You can get rid of this check at all if you reorganize your code i.e.
like this:

	...

	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
	if (vecs < 0)
		goto legacy;

	for (i = 0; i < vecs; i++)
		irqs[i] = dev->irq + i;

done:
	...


legacy:
	...

> +
> +	/* use legacy irq */
> +	kfree(irqs);
> +	dev->irqs = &dev->irq;
> +	return 1;
> +
> +done:
> +	dev->irqs = irqs;
> +	return vecs;
> +}
> +EXPORT_SYMBOL(pci_alloc_irq_vectors);
> +
> +/**
> + * pci_free_irq_vectors - free previously allocated IRQs for a device
> + * @dev:		PCI device to operate on
> + *
> + * Undoes the allocations and enabling in pci_alloc_irq_vectors().
> + */
> +void pci_free_irq_vectors(struct pci_dev *dev)
> +{
> +	if (dev->msix_enabled)
> +		pci_disable_msix(dev);
> +	else if (dev->msi_enabled)
> +		pci_disable_msi(dev);

The checks are probably redundant or incomplete. Redundant - because
pci_disable_msi()/pci_disable_msix() do it anyways:

	if (!pci_msi_enable || !dev || !dev->msi_enabled)
		return;

Incomplete - because the two other conditions are not checked.

> +	if (dev->irqs != &dev->irq)
> +		kfree(dev->irqs);

Unset dev->irqs?

BTW, since (dev->irqs == &dev->irq) effectively checks if MSI/MSI-X
was enabled this function could bail out in case they did not.

> +}
> +EXPORT_SYMBOL(pci_free_irq_vectors);
> +
> +
>  struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
>  {
>  	return to_pci_dev(desc->dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b67e4df..84a20fc 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -320,6 +320,7 @@ struct pci_dev {
>  	 * directly, use the values stored here. They might be different!
>  	 */
>  	unsigned int	irq;
> +	unsigned int	*irqs;
>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
>  
>  	bool match_driver;		/* Skip attaching driver */
> @@ -1237,6 +1238,8 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev, int resno);
>  int pci_set_vga_state(struct pci_dev *pdev, bool decode,
>  		      unsigned int command_bits, u32 flags);
>  
> +#define PCI_IRQ_NOMSIX		(1 << 0) /* don't try to use MSI-X interrupts */

BTW, why PCI_IRQ_NOMSIX only and no PCI_IRQ_NOMSI?

>  /* kmem_cache style wrapper around pci_alloc_consistent() */
>  
>  #include <linux/pci-dma.h>
> @@ -1284,6 +1287,9 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  		return rc;
>  	return 0;
>  }
> +int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> +		unsigned int max_vecs, unsigned int flags);
> +void pci_free_irq_vectors(struct pci_dev *dev);
>  #else
>  static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
>  static inline void pci_msi_shutdown(struct pci_dev *dev) { }
> @@ -1307,6 +1313,18 @@ static inline int pci_enable_msix_range(struct pci_dev *dev,
>  static inline int pci_enable_msix_exact(struct pci_dev *dev,
>  		      struct msix_entry *entries, int nvec)
>  { return -ENOSYS; }
> +static inline int pci_alloc_irq_vectors(struct pci_dev *dev,
> +		unsigned int min_vecs, unsigned int max_vecs,
> +		unsigned int flags)
> +{
> +	if (min_vecs > 1)
> +		return -ENOSPC;
> +	dev->irqs = &dev->irq;
> +	return 1;
> +}
> +static inline void pci_free_irq_vectors(struct pci_dev *dev)
> +{

Unset dev->irqs?

> +}
>  #endif
>  
>  #ifdef CONFIG_PCIEPORTBUS
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-14 19:58 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
  2016-06-14 21:54   ` Guilherme G. Piccoli
@ 2016-06-25 20:05   ` Alexander Gordeev
  2016-06-30 17:48     ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-25 20:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:58:59PM +0200, Christoph Hellwig wrote:
> This is lifted from the blk-mq code and adopted to use the affinity mask
> concept just intruced in the irq handling code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/interrupt.h | 11 +++++++++
>  kernel/irq/Makefile       |  1 +
>  kernel/irq/affinity.c     | 60 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 72 insertions(+)
>  create mode 100644 kernel/irq/affinity.c
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 9fcabeb..12003c0 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -278,6 +278,9 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
>  extern int
>  irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
>  
> +int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs);
> +
>  #else /* CONFIG_SMP */
>  
>  static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
> @@ -308,6 +311,14 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
>  {
>  	return 0;
>  }
> +
> +static inline int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs)
> +{
> +	*affinity_mask = NULL;
> +	*nr_vecs = 1;
> +	return 0;
> +}
>  #endif /* CONFIG_SMP */
>  
>  /*
> diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
> index 2ee42e9..1d3ee31 100644
> --- a/kernel/irq/Makefile
> +++ b/kernel/irq/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
>  obj-$(CONFIG_PM_SLEEP) += pm.o
>  obj-$(CONFIG_GENERIC_MSI_IRQ) += msi.o
>  obj-$(CONFIG_GENERIC_IRQ_IPI) += ipi.o
> +obj-$(CONFIG_SMP) += affinity.o
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> new file mode 100644
> index 0000000..1daf8fb
> --- /dev/null
> +++ b/kernel/irq/affinity.c
> @@ -0,0 +1,60 @@
> +
> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/cpu.h>
> +
> +static int get_first_sibling(unsigned int cpu)
> +{
> +	unsigned int ret;
> +
> +	ret = cpumask_first(topology_sibling_cpumask(cpu));
> +	if (ret < nr_cpu_ids)
> +		return ret;
> +	return cpu;
> +}
> +
> +/*
> + * Take a map of online CPUs and the number of available interrupt vectors
> + * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
> + * so that they are distributed as good as possible around the CPUs.  If
> + * more vectors than CPUs are available we'll map one to each CPU,

Unless I do not misinterpret a loop from msix_setup_entries() (patch 08/13),
the above is incorrect:

	for (i = 0; i < nvec; i++) {
		if (dev->irq_affinity) {
			cpu = cpumask_next(cpu, dev->irq_affinity);
			if (cpu >= nr_cpu_ids)
				cpu = cpumask_first(dev->irq_affinity);
			mask = cpumask_of(cpu);
		}

		...

		entry->affinity			= mask;
	}

> + * otherwise we map one to the first sibling of each socket.

(*) I guess, in some topology configurations a total number of all
first siblings may be less than the number of vectors.

> + * If there are more vectors than CPUs we will still only have one bit
> + * set per CPU, but interrupt code will keep on assining the vectors from
> + * the start of the bitmap until we run out of vectors.
> + */
> +int irq_create_affinity_mask(struct cpumask **affinity_mask,
> +		unsigned int *nr_vecs)

Both the callers of this function and the function itself IMHO would
read better if it simply returned the affinity mask. Or passed the 
affinity mask pointer.

> +{
> +	unsigned int vecs = 0;

In case (*nr_vecs >= num_online_cpus()) the contents of *nr_vecs
will be overwritten with 0.

> +	if (*nr_vecs == 1) {
> +		*affinity_mask = NULL;
> +		return 0;
> +	}
> +
> +	*affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
> +	if (!*affinity_mask)
> +		return -ENOMEM;
> +
> +	if (*nr_vecs >= num_online_cpus()) {
> +		cpumask_copy(*affinity_mask, cpu_online_mask);
> +	} else {
> +		unsigned int cpu;
> +
> +		for_each_online_cpu(cpu) {
> +			if (cpu == get_first_sibling(cpu)) {
> +				cpumask_set_cpu(cpu, *affinity_mask);
> +				vecs++;
> +			}
> +
> +			if (--(*nr_vecs) == 0)
> +				break;
> +		}
> +	}
> +
> +	*nr_vecs = vecs;

So considering (*) comment above the number of available vectors
might be unnecessarily shrunken here.

I think nr_vecs need not be an out-parameter since we always can
assign multiple vectors to a CPU. It is better than limiting number
of available vectors AFAIKT. Or you could pass one-per-cpu flag
explicitly.

> +	return 0;
> +}
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors
  2016-06-14 19:59 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
@ 2016-06-25 20:22   ` Alexander Gordeev
  0 siblings, 0 replies; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-25 20:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:59:01PM +0200, Christoph Hellwig wrote:
> Set the affinity_mask before allocating vectors.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/pci/msi.c   | 26 ++++++++++++++++++++++++--
>  include/linux/pci.h |  1 +
>  2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index a33adec..50d694c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -568,6 +568,7 @@ static struct msi_desc *msi_setup_entry(struct pci_dev *dev, int nvec)
>  	entry->msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
>  	entry->msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
>  	entry->nvec_used		= nvec;
> +	entry->affinity			= dev->irq_affinity;
>  
>  	if (control & PCI_MSI_FLAGS_64BIT)
>  		entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
> @@ -679,10 +680,18 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
>  static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			      struct msix_entry *entries, int nvec)
>  {
> +	const struct cpumask *mask = NULL;
>  	struct msi_desc *entry;
> -	int i;
> +	int cpu = -1, i;
>  
>  	for (i = 0; i < nvec; i++) {
> +		if (dev->irq_affinity) {
> +			cpu = cpumask_next(cpu, dev->irq_affinity);
> +			if (cpu >= nr_cpu_ids)
> +				cpu = cpumask_first(dev->irq_affinity);
> +			mask = cpumask_of(cpu);
> +		}
> +
>  		entry = alloc_msi_entry(&dev->dev);
>  		if (!entry) {
>  			if (!i)
> @@ -699,6 +708,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  		entry->msi_attrib.default_irq	= dev->irq;
>  		entry->mask_base		= base;
>  		entry->nvec_used		= 1;
> +		entry->affinity			= mask;
>  
>  		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
>  	}
> @@ -1176,12 +1186,20 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>  {
>  	unsigned int vecs, i;
>  	u32 *irqs;
> +	int ret;
>  
>  	max_vecs = min(max_vecs, pci_nr_irq_vectors(dev));
>  
> +	ret = irq_create_affinity_mask(&dev->irq_affinity, &max_vecs);

	dev->irq_affinity = irq_create_affinity_mask(&max_vecs); ?


> +	if (ret)
> +		return ret;
> +	if (max_vecs < min_vecs)
> +		return -ENOSPC;

irq_create_affinity_mask() should be called after MSI-X/MSI is enabled,
because we do not know number of vectors before the range functions
returned that number.

Since affinity masks is a function of number of vectors and CPU topology
the resulting masks might turn out suboptimal in general case (and
this code supposed to be general, right?).

I.e irq_create_affinity_mask() could decide "per-first-sibling" spreading
given number of available vectors, but only a subset of MSI vectors
were actually allocated. For that subset "per-core" affinity mask could
have been initialized, but we will still go with "per-first-sibling".

> +	ret = -ENOMEM;
>  	irqs = kcalloc(max_vecs, sizeof(u32), GFP_KERNEL);
>  	if (!irqs)
> -		return -ENOMEM;
> +		goto out_free_affinity;
>  
>  	if (!(flags & PCI_IRQ_NOMSIX)) {
>  		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
> @@ -1208,6 +1226,10 @@ int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>  done:
>  	dev->irqs = irqs;
>  	return vecs;
> +out_free_affinity:
> +	kfree(dev->irq_affinity);
> +	dev->irq_affinity = NULL;
> +	return ret;
>  }
>  EXPORT_SYMBOL(pci_alloc_irq_vectors);
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 84a20fc..f474611 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>  	 */
>  	unsigned int	irq;
>  	unsigned int	*irqs;
> +	struct cpumask	*irq_affinity;
>  	struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory regions + expansion ROMs */
>  
>  	bool match_driver;		/* Skip attaching driver */
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: automatic interrupt affinity for MSI/MSI-X capable devices V2
  2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2016-06-16  9:45 ` automatic interrupt affinity for MSI/MSI-X capable devices V2 Bart Van Assche
@ 2016-06-26 19:40 ` Alexander Gordeev
  14 siblings, 0 replies; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-26 19:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:58:53PM +0200, Christoph Hellwig wrote:
> This series enhances the irq and PCI code to allow spreading around MSI and
> MSI-X vectors so that they have per-cpu affinity if possible, or at least
> per-node.  For that it takes the algorithm from blk-mq, moves it to
> a common place, and makes it available through a vastly simplified PCI
> interrupt allocation API.  It then switches blk-mq to be able to pick up
> the queue mapping from the device if available, and demonstrates all this
> using the NVMe driver.

Hi Christoph,

One general comment. As result of this series there will be
three locations to store/point to affinities: IRQ descriptor,
MSI descriptor and PCI device descriptor.

IRQ and MSI descriptors merely refer to duplicate masks while
the PCI device mask is the sum of all its MSI interrupts' masks.

Besides, MSI descriptors and PCI device affinity masks are only
used just once - at MSI initialization.

Overall, it looks like some cleanup is possible here.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-06-23 11:16   ` Alexander Gordeev
@ 2016-06-30 16:54     ` Christoph Hellwig
  2016-06-30 17:28       ` Alexander Gordeev
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-30 16:54 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Thu, Jun 23, 2016 at 01:16:10PM +0200, Alexander Gordeev wrote:
> New APIs should be documented in Documentation/PCI/MSI-HOWTO.txt, I guess.

Ok, done.

> > +static unsigned int pci_nr_irq_vectors(struct pci_dev *pdev)
> > +{
> > +	int nr_entries;
> > +
> > +	nr_entries = pci_msix_vec_count(pdev);
> > +	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
> > +		nr_entries = pci_msi_vec_count(pdev);
> > +	if (nr_entries <= 0)
> > +		nr_entries = 1;
> > +	return nr_entries;
> > +}
> 
> This function is strange, because it:
>   (a) does not consider PCI_IRQ_NOMSIX flag;
>   (b) only calls pci_msi_supported() for MSI case;
>   (c) calls pci_msi_supported() with just one vector;
>   (d) might return suboptimal number of vectors (number of MSI-X used 
>       later for MSI or vice versa)
> 
> Overall, I would suggest simply return maximum between MSI-X and MSI
> numbers and let the rest of the code (i.e the two range functions)
> handle a-d.

Ok, fixed except for (c) - the only thing pci_msi_supported does with
nvec is to check for it being less than 1, which we don't care about,
and which really shouldn't be in this function to start with.

> > +	struct msix_entry *msix_entries;
> > +	int vecs, i;
> > +
> > +	msix_entries = kcalloc(max_vecs, sizeof(struct msix_entry), GFP_KERNEL);
> > +	if (!msix_entries)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < max_vecs; i++)
> > +		msix_entries[i].entry = i;
> > +
> > +	vecs = pci_enable_msix_range(pdev, msix_entries, min_vecs, max_vecs);
> > +	if (vecs > 0) {
> 
> This condition check is unneeded.

Why?  We could get -ENOSPC back.  Oh, because our for loop will
terminate immediately.  I can update it, but I think removing it
is less readable than keeping it around.

> > +	if (!(flags & PCI_IRQ_NOMSIX)) {
> > +		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
> > +				max_vecs);
> > +		if (vecs > 0)
> > +			goto done;
> > +	}
> > +
> > +	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> > +	if (vecs > 0) {
> > +		for (i = 0; i < vecs; i++)
> > +			irqs[i] = dev->irq + i;
> > +		goto done;
> > +	}
> > +
> > +	if (min_vecs > 1)
> > +		return -ENOSPC;
> 
> irqs is leaked if (min_vecs > 1)
> 
> You can get rid of this check at all if you reorganize your code i.e.
> like this:
> 
> 	...
> 
> 	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> 	if (vecs < 0)
> 		goto legacy;
> 
> 	for (i = 0; i < vecs; i++)
> 		irqs[i] = dev->irq + i;
> 
> done:
> 	...
> 
> 
> legacy:
> 	...

I've just moved the if below the kfree.

> > +void pci_free_irq_vectors(struct pci_dev *dev)
> > +{
> > +	if (dev->msix_enabled)
> > +		pci_disable_msix(dev);
> > +	else if (dev->msi_enabled)
> > +		pci_disable_msi(dev);
> 
> The checks are probably redundant or incomplete. Redundant - because
> pci_disable_msi()/pci_disable_msix() do it anyways:
> 
> 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
> 		return;
> 
> Incomplete - because the two other conditions are not checked.

Ok, I've dropped the check.

> 
> > +	if (dev->irqs != &dev->irq)
> > +		kfree(dev->irqs);
> 
> Unset dev->irqs?

Fine with me, added.

> > +#define PCI_IRQ_NOMSIX		(1 << 0) /* don't try to use MSI-X interrupts */
> 
> BTW, why PCI_IRQ_NOMSIX only and no PCI_IRQ_NOMSI?

Because there is no need to call this API if your device only supports
a single legacy vector anyway.

> > +	if (min_vecs > 1)
> > +		return -ENOSPC;
> > +	dev->irqs = &dev->irq;
> > +	return 1;
> > +}
> > +static inline void pci_free_irq_vectors(struct pci_dev *dev)
> > +{
> 
> Unset dev->irqs?

Ok.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-06-30 16:54     ` Christoph Hellwig
@ 2016-06-30 17:28       ` Alexander Gordeev
  2016-06-30 17:35         ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Alexander Gordeev @ 2016-06-30 17:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Thu, Jun 30, 2016 at 06:54:17PM +0200, Christoph Hellwig wrote:
> > > +static unsigned int pci_nr_irq_vectors(struct pci_dev *pdev)
> > > +{
> > > +	int nr_entries;
> > > +
> > > +	nr_entries = pci_msix_vec_count(pdev);
> > > +	if (nr_entries <= 0 && pci_msi_supported(pdev, 1))
> > > +		nr_entries = pci_msi_vec_count(pdev);
> > > +	if (nr_entries <= 0)
> > > +		nr_entries = 1;
> > > +	return nr_entries;
> > > +}
> > 
> > This function is strange, because it:
> >   (a) does not consider PCI_IRQ_NOMSIX flag;
> >   (b) only calls pci_msi_supported() for MSI case;
> >   (c) calls pci_msi_supported() with just one vector;
> >   (d) might return suboptimal number of vectors (number of MSI-X used 
> >       later for MSI or vice versa)
> > 
> > Overall, I would suggest simply return maximum between MSI-X and MSI
> > numbers and let the rest of the code (i.e the two range functions)
> > handle a-d.
> 
> Ok, fixed except for (c) - the only thing pci_msi_supported does with
> nvec is to check for it being less than 1, which we don't care about,
> and which really shouldn't be in this function to start with.

Yes, but we should not rely on our knowledge of pci_msi_supported()
internals, aren't we? We need to follow the API which asks nvecs for
whatever reason. Anyway, if you return maximum of the two it does not
matter.

> > > +	struct msix_entry *msix_entries;
> > > +	int vecs, i;
> > > +
> > > +	msix_entries = kcalloc(max_vecs, sizeof(struct msix_entry), GFP_KERNEL);
> > > +	if (!msix_entries)
> > > +		return -ENOMEM;
> > > +
> > > +	for (i = 0; i < max_vecs; i++)
> > > +		msix_entries[i].entry = i;
> > > +
> > > +	vecs = pci_enable_msix_range(pdev, msix_entries, min_vecs, max_vecs);
> > > +	if (vecs > 0) {
> > 
> > This condition check is unneeded.
> 
> Why?  We could get -ENOSPC back.  Oh, because our for loop will
> terminate immediately.  I can update it, but I think removing it
> is less readable than keeping it around.

Yes, I think you are right.

> > > +	if (!(flags & PCI_IRQ_NOMSIX)) {
> > > +		vecs = pci_enable_msix_range_wrapper(dev, irqs, min_vecs,
> > > +				max_vecs);
> > > +		if (vecs > 0)
> > > +			goto done;
> > > +	}
> > > +
> > > +	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> > > +	if (vecs > 0) {
> > > +		for (i = 0; i < vecs; i++)
> > > +			irqs[i] = dev->irq + i;
> > > +		goto done;
> > > +	}
> > > +
> > > +	if (min_vecs > 1)
> > > +		return -ENOSPC;
> > 
> > irqs is leaked if (min_vecs > 1)
> > 
> > You can get rid of this check at all if you reorganize your code i.e.
> > like this:
> > 
> > 	...
> > 
> > 	vecs = pci_enable_msi_range(dev, min_vecs, max_vecs);
> > 	if (vecs < 0)
> > 		goto legacy;
> > 
> > 	for (i = 0; i < vecs; i++)
> > 		irqs[i] = dev->irq + i;
> > 
> > done:
> > 	...
> > 
> > 
> > legacy:
> > 	...
> 
> I've just moved the if below the kfree.

I think I need to look at the updated version :)

> > > +#define PCI_IRQ_NOMSIX		(1 << 0) /* don't try to use MSI-X interrupts */
> > 
> > BTW, why PCI_IRQ_NOMSIX only and no PCI_IRQ_NOMSI?
> 
> Because there is no need to call this API if your device only supports
> a single legacy vector anyway.

What if a device reports (up to 32) MSIs and MSI-X allocation failed? The
driver might prefer the legacy single (i.e. due to errata in MSI), but
there is no flag to ask for it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines
  2016-06-30 17:28       ` Alexander Gordeev
@ 2016-06-30 17:35         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-30 17:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Thu, Jun 30, 2016 at 07:28:09PM +0200, Alexander Gordeev wrote:
> I think I need to look at the updated version :)

I've astarted processing the comments for the whole series and plan
to post an update tomorrow.

> > > > +#define PCI_IRQ_NOMSIX		(1 << 0) /* don't try to use MSI-X interrupts */
> > > 
> > > BTW, why PCI_IRQ_NOMSIX only and no PCI_IRQ_NOMSI?
> > 
> > Because there is no need to call this API if your device only supports
> > a single legacy vector anyway.
> 
> What if a device reports (up to 32) MSIs and MSI-X allocation failed? The
> driver might prefer the legacy single (i.e. due to errata in MSI), but
> there is no flag to ask for it.

Ok, I'll add it for now.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-25 20:05   ` Alexander Gordeev
@ 2016-06-30 17:48     ` Christoph Hellwig
  2016-07-01  7:25       ` Alexander Gordeev
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-06-30 17:48 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Sat, Jun 25, 2016 at 10:05:19PM +0200, Alexander Gordeev wrote:
> > + * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
> > + * so that they are distributed as good as possible around the CPUs.  If
> > + * more vectors than CPUs are available we'll map one to each CPU,
> 
> Unless I do not misinterpret a loop from msix_setup_entries() (patch 08/13),
> the above is incorrect:

What part do you think is incorrect?

> > + * otherwise we map one to the first sibling of each socket.
> 
> (*) I guess, in some topology configurations a total number of all
> first siblings may be less than the number of vectors.

Yes, in that case we'll assign imcompetely.  I've already heard people
complaining about that at LSF/MM, but no one volunteered patches.
I only have devices with 1 or enough vectores to test, so I don't
really dare to touch the algorithm.  Either way the algorithm
change should probably be a different patch than refactoring it and
moving it around.

> > + * If there are more vectors than CPUs we will still only have one bit
> > + * set per CPU, but interrupt code will keep on assining the vectors from
> > + * the start of the bitmap until we run out of vectors.
> > + */
> > +int irq_create_affinity_mask(struct cpumask **affinity_mask,
> > +		unsigned int *nr_vecs)
> 
> Both the callers of this function and the function itself IMHO would
> read better if it simply returned the affinity mask. Or passed the 
> affinity mask pointer.

We can't just return the pointer as NULL is a valid and common return
value.  If we pass the pointer we'd then also need to allocate one for
the (common) nvec = 1 case.

> 
> > +{
> > +	unsigned int vecs = 0;
> 
> In case (*nr_vecs >= num_online_cpus()) the contents of *nr_vecs
> will be overwritten with 0.

Thanks, fixed.

> So considering (*) comment above the number of available vectors
> might be unnecessarily shrunken here.
> 
> I think nr_vecs need not be an out-parameter since we always can
> assign multiple vectors to a CPU. It is better than limiting number
> of available vectors AFAIKT. Or you could pass one-per-cpu flag
> explicitly.

The function is intended to replicate the blk-mq algorithm.  I don't
think it's optimal, but I really want to avoid dragging the discussion
about the optimal algorithm into this patchset.  We should at least
move to a vector per node/socket model instead of just the siblings,
and be able to use all vectors (at least optionally).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors
  2016-06-30 17:48     ` Christoph Hellwig
@ 2016-07-01  7:25       ` Alexander Gordeev
  0 siblings, 0 replies; 54+ messages in thread
From: Alexander Gordeev @ 2016-07-01  7:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Thu, Jun 30, 2016 at 07:48:54PM +0200, Christoph Hellwig wrote:
> On Sat, Jun 25, 2016 at 10:05:19PM +0200, Alexander Gordeev wrote:
> > > + * and generate an output cpumask suitable for spreading MSI/MSI-X vectors
> > > + * so that they are distributed as good as possible around the CPUs.  If
> > > + * more vectors than CPUs are available we'll map one to each CPU,
> > 
> > Unless I do not misinterpret a loop from msix_setup_entries() (patch 08/13),
> > the above is incorrect:
> 
> What part do you think is incorrect?

With your explanations below and no immediate intention to fix the
algorithm it is correct.

> > (*) I guess, in some topology configurations a total number of all
> > first siblings may be less than the number of vectors.
> 
> Yes, in that case we'll assign imcompetely.  I've already heard people
> complaining about that at LSF/MM, but no one volunteered patches.
> I only have devices with 1 or enough vectores to test, so I don't
> really dare to touch the algorithm.  Either way the algorithm
> change should probably be a different patch than refactoring it and
> moving it around.

I see your approach now. Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
@ 2016-07-04  8:15   ` Alexander Gordeev
  2016-07-04  8:38     ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Alexander Gordeev @ 2016-07-04  8:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> +		const struct cpumask *affinity_mask)
> +{
> +	int queue = -1, cpu = 0;
> +
> +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> +			GFP_KERNEL, set->numa_node);
> +	if (!set->mq_map)
> +		return -ENOMEM;
> +
> +	if (!affinity_mask)
> +		return 0;	/* map all cpus to queue 0 */
> +
> +	/* If cpus are offline, map them to first hctx */
> +	for_each_online_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, affinity_mask))
> +			queue++;

CPUs missing in an affinity mask are mapped to hctxs. Is that intended?

> +		if (queue > 0)

Why this check?

> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  8:15   ` Alexander Gordeev
@ 2016-07-04  8:38     ` Christoph Hellwig
  2016-07-04  9:35       ` Alexander Gordeev
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-07-04  8:38 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 10:15:41AM +0200, Alexander Gordeev wrote:
> On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> > +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> > +		const struct cpumask *affinity_mask)
> > +{
> > +	int queue = -1, cpu = 0;
> > +
> > +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> > +			GFP_KERNEL, set->numa_node);
> > +	if (!set->mq_map)
> > +		return -ENOMEM;
> > +
> > +	if (!affinity_mask)
> > +		return 0;	/* map all cpus to queue 0 */
> > +
> > +	/* If cpus are offline, map them to first hctx */
> > +	for_each_online_cpu(cpu) {
> > +		if (cpumask_test_cpu(cpu, affinity_mask))
> > +			queue++;
> 
> CPUs missing in an affinity mask are mapped to hctxs. Is that intended?

Yes - each CPU needs to be mapped to some hctx, otherwise we can't
submit I/O from that CPU.

> > +		if (queue > 0)
> 
> Why this check?
> 
> > +			set->mq_map[cpu] = queue;

mq_map is initialized to zero already, so we don't really need the
assignment for queue 0.  The reason why this check exists is because
we start with queue = -1 and we never want to assignment -1 to mq_map.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  8:38     ` Christoph Hellwig
@ 2016-07-04  9:35       ` Alexander Gordeev
  2016-07-10  3:41         ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Alexander Gordeev @ 2016-07-04  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 10:38:49AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 04, 2016 at 10:15:41AM +0200, Alexander Gordeev wrote:
> > On Tue, Jun 14, 2016 at 09:59:04PM +0200, Christoph Hellwig wrote:
> > > +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> > > +		const struct cpumask *affinity_mask)
> > > +{
> > > +	int queue = -1, cpu = 0;
> > > +
> > > +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> > > +			GFP_KERNEL, set->numa_node);
> > > +	if (!set->mq_map)
> > > +		return -ENOMEM;
> > > +
> > > +	if (!affinity_mask)
> > > +		return 0;	/* map all cpus to queue 0 */
> > > +
> > > +	/* If cpus are offline, map them to first hctx */
> > > +	for_each_online_cpu(cpu) {
> > > +		if (cpumask_test_cpu(cpu, affinity_mask))
> > > +			queue++;
> > 
> > CPUs missing in an affinity mask are mapped to hctxs. Is that intended?
> 
> Yes - each CPU needs to be mapped to some hctx, otherwise we can't
> submit I/O from that CPU.
> 
> > > +		if (queue > 0)
> > 
> > Why this check?
> > 
> > > +			set->mq_map[cpu] = queue;
> 
> mq_map is initialized to zero already, so we don't really need the
> assignment for queue 0.  The reason why this check exists is because
> we start with queue = -1 and we never want to assignment -1 to mq_map.

Would this read better then?

	int queue = 0;

	...

	/* If cpus are offline, map them to first hctx */
	for_each_online_cpu(cpu) {
		set->mq_map[cpu] = queue;
		if (cpumask_test_cpu(cpu, affinity_mask))
			queue++;
	}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-04  9:35       ` Alexander Gordeev
@ 2016-07-10  3:41         ` Christoph Hellwig
  2016-07-12  6:42           ` Alexander Gordeev
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2016-07-10  3:41 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Christoph Hellwig, tglx, axboe, linux-block, linux-pci,
	linux-nvme, linux-kernel

On Mon, Jul 04, 2016 at 11:35:28AM +0200, Alexander Gordeev wrote:
> > mq_map is initialized to zero already, so we don't really need the
> > assignment for queue 0.  The reason why this check exists is because
> > we start with queue = -1 and we never want to assignment -1 to mq_map.
> 
> Would this read better then?
> 
> 	int queue = 0;
> 
> 	...
> 
> 	/* If cpus are offline, map them to first hctx */
> 	for_each_online_cpu(cpu) {
> 		set->mq_map[cpu] = queue;
> 		if (cpumask_test_cpu(cpu, affinity_mask))
> 			queue++;

It would read better, but I don't think it's actually correct.
We'd still assign the 'old' queue to the cpu that is set in the affinity
mask.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask
  2016-07-10  3:41         ` Christoph Hellwig
@ 2016-07-12  6:42           ` Alexander Gordeev
  0 siblings, 0 replies; 54+ messages in thread
From: Alexander Gordeev @ 2016-07-12  6:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tglx, axboe, linux-block, linux-pci, linux-nvme, linux-kernel

On Sun, Jul 10, 2016 at 05:41:44AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 04, 2016 at 11:35:28AM +0200, Alexander Gordeev wrote:
> > > mq_map is initialized to zero already, so we don't really need the
> > > assignment for queue 0.  The reason why this check exists is because
> > > we start with queue = -1 and we never want to assignment -1 to mq_map.
> > 
> > Would this read better then?
> > 
> > 	int queue = 0;
> > 
> > 	...
> > 
> > 	/* If cpus are offline, map them to first hctx */
> > 	for_each_online_cpu(cpu) {
> > 		set->mq_map[cpu] = queue;
> > 		if (cpumask_test_cpu(cpu, affinity_mask))
> > 			queue++;
> 
> It would read better, but I don't think it's actually correct.
> We'd still assign the 'old' queue to the cpu that is set in the affinity
> mask.

To be honest, I fail to see a functional difference, but it is just
a nit anyway.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-07-12  6:37 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 19:58 automatic interrupt affinity for MSI/MSI-X capable devices V2 Christoph Hellwig
2016-06-14 19:58 ` [PATCH 01/13] irq/msi: Remove unused MSI_FLAG_IDENTITY_MAP Christoph Hellwig
2016-06-16  9:05   ` Bart Van Assche
2016-06-14 19:58 ` [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag Christoph Hellwig
2016-06-15  8:44   ` Bart Van Assche
2016-06-15 10:23     ` Christoph Hellwig
2016-06-15 10:42       ` Bart Van Assche
2016-06-15 15:14         ` Keith Busch
2016-06-15 15:28           ` Bart Van Assche
2016-06-15 16:03             ` Keith Busch
2016-06-15 19:36               ` Bart Van Assche
2016-06-15 20:06                 ` Keith Busch
2016-06-15 20:12                   ` Keith Busch
2016-06-15 20:50                     ` Bart Van Assche
2016-06-16 15:19                       ` Keith Busch
2016-06-22 11:56                         ` Alexander Gordeev
2016-06-16 15:20                 ` Christoph Hellwig
2016-06-16 15:39                   ` Bart Van Assche
2016-06-20 12:22                     ` Christoph Hellwig
2016-06-20 13:21                       ` Bart Van Assche
2016-06-21 14:31                         ` Christoph Hellwig
2016-06-16  9:08   ` Bart Van Assche
2016-06-14 19:58 ` [PATCH 03/13] irq: Add affinity hint to irq allocation Christoph Hellwig
2016-06-14 19:58 ` [PATCH 04/13] irq: Use affinity hint in irqdesc allocation Christoph Hellwig
2016-06-14 19:58 ` [PATCH 05/13] irq/msi: Make use of affinity aware allocations Christoph Hellwig
2016-06-14 19:58 ` [PATCH 06/13] irq: add a helper spread an affinity mask for MSI/MSI-X vectors Christoph Hellwig
2016-06-14 21:54   ` Guilherme G. Piccoli
2016-06-15  8:35     ` Bart Van Assche
2016-06-15 10:10     ` Christoph Hellwig
2016-06-15 13:09       ` Guilherme G. Piccoli
2016-06-16 15:16         ` Christoph Hellwig
2016-06-25 20:05   ` Alexander Gordeev
2016-06-30 17:48     ` Christoph Hellwig
2016-07-01  7:25       ` Alexander Gordeev
2016-06-14 19:59 ` [PATCH 07/13] pci: Provide sensible irq vector alloc/free routines Christoph Hellwig
2016-06-23 11:16   ` Alexander Gordeev
2016-06-30 16:54     ` Christoph Hellwig
2016-06-30 17:28       ` Alexander Gordeev
2016-06-30 17:35         ` Christoph Hellwig
2016-06-14 19:59 ` [PATCH 08/13] pci: spread interrupt vectors in pci_alloc_irq_vectors Christoph Hellwig
2016-06-25 20:22   ` Alexander Gordeev
2016-06-14 19:59 ` [PATCH 09/13] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
2016-06-14 19:59 ` [PATCH 10/13] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
2016-06-14 19:59 ` [PATCH 11/13] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
2016-07-04  8:15   ` Alexander Gordeev
2016-07-04  8:38     ` Christoph Hellwig
2016-07-04  9:35       ` Alexander Gordeev
2016-07-10  3:41         ` Christoph Hellwig
2016-07-12  6:42           ` Alexander Gordeev
2016-06-14 19:59 ` [PATCH 12/13] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
2016-06-14 19:59 ` [PATCH 13/13] nvme: remove the post_scan callout Christoph Hellwig
2016-06-16  9:45 ` automatic interrupt affinity for MSI/MSI-X capable devices V2 Bart Van Assche
2016-06-16 15:22   ` Christoph Hellwig
2016-06-26 19:40 ` Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).