linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* spread MSI(-X) vectors to all possible CPUs V2
@ 2017-06-03 14:03 Christoph Hellwig
  2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:03 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Hi all,

this series changes our automatic MSI-X vector assignment so that it
takes all present CPUs into account instead of all online ones.  This
allows to better deal with cpu hotplug events, which could happen
frequently due to power management for example.

Changes since V1:
 - rebase to current Linus' tree
 - add irq_lock_sparse calls
 - move memory allocations outside of (raw) spinlocks
 - make the possible cpus per node mask safe vs physical CPU hotplug
 - remove the irq_force_complete_move call
 - factor some common code into helpers
 - identation fixups

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
@ 2017-06-03 14:03 ` Christoph Hellwig
  2017-06-04 15:14   ` Sagi Grimberg
  2017-06-17 23:21   ` Thomas Gleixner
  2017-06-03 14:03 ` [PATCH 2/8] genirq: move pending helpers to internal.h Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:03 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

This will allow us to spread MSI/MSI-X affinity over all present CPUs and
thus better deal with systems where cpus are take on and offline all the
time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 kernel/irq/manage.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 070be980c37a..5c25d4a5dc46 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -361,17 +361,17 @@ static int setup_affinity(struct irq_desc *desc, struct cpumask *mask)
 	if (irqd_affinity_is_managed(&desc->irq_data) ||
 	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_common_data.affinity,
-				       cpu_online_mask))
+				       cpu_present_mask))
 			set = desc->irq_common_data.affinity;
 		else
 			irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
 	}
 
-	cpumask_and(mask, cpu_online_mask, set);
+	cpumask_and(mask, cpu_present_mask, set);
 	if (node != NUMA_NO_NODE) {
 		const struct cpumask *nodemask = cpumask_of_node(node);
 
-		/* make sure at least one of the cpus in nodemask is online */
+		/* make sure at least one of the cpus in nodemask is present */
 		if (cpumask_intersects(mask, nodemask))
 			cpumask_and(mask, mask, nodemask);
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/8] genirq: move pending helpers to internal.h
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
  2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
@ 2017-06-03 14:03 ` Christoph Hellwig
  2017-06-04 15:15   ` Sagi Grimberg
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:03 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

So that the affinity code can reuse them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 kernel/irq/internals.h | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/irq/manage.c    | 28 ----------------------------
 2 files changed, 38 insertions(+), 28 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index bc226e783bd2..b81f6ce73a68 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -226,3 +226,41 @@ irq_pm_install_action(struct irq_desc *desc, struct irqaction *action) { }
 static inline void
 irq_pm_remove_action(struct irq_desc *desc, struct irqaction *action) { }
 #endif
+
+#ifdef CONFIG_GENERIC_PENDING_IRQ
+static inline bool irq_can_move_pcntxt(struct irq_data *data)
+{
+	return irqd_can_move_in_process_context(data);
+}
+static inline bool irq_move_pending(struct irq_data *data)
+{
+	return irqd_is_setaffinity_pending(data);
+}
+static inline void
+irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
+{
+	cpumask_copy(desc->pending_mask, mask);
+}
+static inline void
+irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
+{
+	cpumask_copy(mask, desc->pending_mask);
+}
+#else /* CONFIG_GENERIC_PENDING_IRQ */
+static inline bool irq_can_move_pcntxt(struct irq_data *data)
+{
+	return true;
+}
+static inline bool irq_move_pending(struct irq_data *data)
+{
+	return false;
+}
+static inline void
+irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
+{
+}
+static inline void
+irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
+{
+}
+#endif /* CONFIG_GENERIC_PENDING_IRQ */
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5c25d4a5dc46..5fa334e5c046 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -168,34 +168,6 @@ void irq_set_thread_affinity(struct irq_desc *desc)
 			set_bit(IRQTF_AFFINITY, &action->thread_flags);
 }
 
-#ifdef CONFIG_GENERIC_PENDING_IRQ
-static inline bool irq_can_move_pcntxt(struct irq_data *data)
-{
-	return irqd_can_move_in_process_context(data);
-}
-static inline bool irq_move_pending(struct irq_data *data)
-{
-	return irqd_is_setaffinity_pending(data);
-}
-static inline void
-irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
-{
-	cpumask_copy(desc->pending_mask, mask);
-}
-static inline void
-irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
-{
-	cpumask_copy(mask, desc->pending_mask);
-}
-#else
-static inline bool irq_can_move_pcntxt(struct irq_data *data) { return true; }
-static inline bool irq_move_pending(struct irq_data *data) { return false; }
-static inline void
-irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask) { }
-static inline void
-irq_get_pending(struct cpumask *mask, struct irq_desc *desc) { }
-#endif
-
 int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 			bool force)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
  2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
  2017-06-03 14:03 ` [PATCH 2/8] genirq: move pending helpers to internal.h Christoph Hellwig
@ 2017-06-03 14:03 ` Christoph Hellwig
  2017-06-04 15:15   ` Sagi Grimberg
                     ` (3 more replies)
  2017-06-03 14:03 ` [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 4 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:03 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Factor out code from the x86 cpu hot plug code to program the affinity
for a vector for a hot plug / hot unplug event.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/kernel/irq.c     | 23 ++---------------------
 include/linux/interrupt.h |  1 +
 kernel/irq/affinity.c     | 28 ++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f34fe7444836..a54eac5d81b3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -437,7 +437,6 @@ void fixup_irqs(void)
 	struct irq_desc *desc;
 	struct irq_data *data;
 	struct irq_chip *chip;
-	int ret;
 
 	for_each_irq_desc(irq, desc) {
 		int break_affinity = 0;
@@ -482,26 +481,8 @@ void fixup_irqs(void)
 			continue;
 		}
 
-		if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
-			chip->irq_mask(data);
-
-		if (chip->irq_set_affinity) {
-			ret = chip->irq_set_affinity(data, affinity, true);
-			if (ret == -ENOSPC)
-				pr_crit("IRQ %d set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
-		} else {
-			if (!(warned++))
-				set_affinity = 0;
-		}
-
-		/*
-		 * We unmask if the irq was not marked masked by the
-		 * core code. That respects the lazy irq disable
-		 * behaviour.
-		 */
-		if (!irqd_can_move_in_process_context(data) &&
-		    !irqd_irq_masked(data) && chip->irq_unmask)
-			chip->irq_unmask(data);
+		if (!irq_affinity_set(irq, desc, affinity) && !warned++)
+			set_affinity = 0;
 
 		raw_spin_unlock(&desc->lock);
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a6fba4804672..afd3aa33e9b0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -292,6 +292,7 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
 struct cpumask *irq_create_affinity_masks(int nvec, const struct irq_affinity *affd);
 int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd);
+bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask);
 
 #else /* CONFIG_SMP */
 
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e2d356dd7581..3cec0042fad2 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,8 +1,36 @@
 
 #include <linux/interrupt.h>
+#include <linux/irq.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include "internals.h"
+
+bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
+{
+	struct irq_data *data = irq_desc_get_irq_data(desc);
+	struct irq_chip *chip = irq_data_get_irq_chip(data);
+	bool ret = false;
+
+	if (!irq_can_move_pcntxt(data) && chip->irq_mask)
+		chip->irq_mask(data);
+
+	if (chip->irq_set_affinity) {
+		if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
+			pr_crit("IRQ %d set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
+		ret = true;
+	}
+
+	/*
+	 * We unmask if the irq was not marked masked by the core code.
+	 * That respects the lazy irq disable behaviour.
+	 */
+	if (!irq_can_move_pcntxt(data) &&
+	    !irqd_irq_masked(data) && chip->irq_unmask)
+		chip->irq_unmask(data);
+
+	return ret;
+}
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				int cpus_per_vec)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
@ 2017-06-03 14:03 ` Christoph Hellwig
  2017-06-04 15:17   ` Sagi Grimberg
  2017-06-22 17:10   ` [tip:irq/core] genirq/affinity: Assign " tip-bot for Christoph Hellwig
  2017-06-03 14:04 ` [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:03 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Currently we only assign spread vectors to online CPUs, which ties the
IRQ mapping to the currently online devices and doesn't deal nicely with
the fact that CPUs could come and go rapidly due to e.g. power management.

Instead assign vectors to all present CPUs to avoid this churn.

For this we have to build a map of all possible CPUs for a give node, as
the architectures only provide a map of all onlines CPUs.  We do this
dynamically on each call for the vector assingments, which is a bit
suboptimal and could be optimized in the future by provinding a mapping
from the arch code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 kernel/irq/affinity.c | 71 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 14 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 3cec0042fad2..337e6ffba93f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -63,13 +63,54 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 	}
 }
 
-static int get_nodes_in_cpumask(const struct cpumask *mask, nodemask_t *nodemsk)
+static cpumask_var_t *alloc_node_to_present_cpumask(void)
+{
+	int node;
+	cpumask_var_t *masks;
+
+	masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+	if (!masks)
+		return NULL;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
+			goto out_unwind;
+	}
+
+	return masks;
+
+out_unwind:
+	while (--node >= 0)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+	return NULL;
+}
+
+static void free_node_to_present_cpumask(cpumask_var_t *masks)
+{
+	int node;
+
+	for (node = 0; node < nr_node_ids; node++)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+}
+
+static void build_node_to_present_cpumask(cpumask_var_t *masks)
+{
+	int cpu;
+
+	for_each_present_cpu(cpu)
+		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
+}
+
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+				const struct cpumask *mask, nodemask_t *nodemsk)
 {
 	int n, nodes = 0;
 
 	/* Calculate the number of nodes in the supplied affinity mask */
-	for_each_online_node(n) {
-		if (cpumask_intersects(mask, cpumask_of_node(n))) {
+	for_each_node(n) {
+		if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
 			node_set(n, *nodemsk);
 			nodes++;
 		}
@@ -92,7 +133,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	int last_affv = affv + affd->pre_vectors;
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct cpumask *masks;
-	cpumask_var_t nmsk;
+	cpumask_var_t nmsk, *node_to_present_cpumask;
 
 	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
 		return NULL;
@@ -101,13 +142,19 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	if (!masks)
 		goto out;
 
+	node_to_present_cpumask = alloc_node_to_present_cpumask();
+	if (!node_to_present_cpumask)
+		goto out;
+
 	/* Fill out vectors at the beginning that don't need affinity */
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
 		cpumask_copy(masks + curvec, irq_default_affinity);
 
 	/* Stabilize the cpumasks */
 	get_online_cpus();
-	nodes = get_nodes_in_cpumask(cpu_online_mask, &nodemsk);
+	build_node_to_present_cpumask(node_to_present_cpumask);
+	nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
+			&nodemsk);
 
 	/*
 	 * If the number of nodes in the mask is greater than or equal the
@@ -115,7 +162,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	 */
 	if (affv <= nodes) {
 		for_each_node_mask(n, nodemsk) {
-			cpumask_copy(masks + curvec, cpumask_of_node(n));
+			cpumask_copy(masks + curvec,
+				     node_to_present_cpumask[n]);
 			if (++curvec == last_affv)
 				break;
 		}
@@ -129,7 +177,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
 		/* Get the cpus on this node which are in the mask */
-		cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
+		cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
@@ -161,6 +209,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	/* Fill out vectors at the end that don't need affinity */
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(masks + curvec, irq_default_affinity);
+	free_node_to_present_cpumask(node_to_present_cpumask);
 out:
 	free_cpumask_var(nmsk);
 	return masks;
@@ -175,12 +224,6 @@ int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd)
 {
 	int resv = affd->pre_vectors + affd->post_vectors;
 	int vecs = maxvec - resv;
-	int cpus;
-
-	/* Stabilize the cpumasks */
-	get_online_cpus();
-	cpus = cpumask_weight(cpu_online_mask);
-	put_online_cpus();
 
-	return min(cpus, vecs) + resv;
+	return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2017-06-03 14:03 ` [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs Christoph Hellwig
@ 2017-06-03 14:04 ` Christoph Hellwig
  2017-06-16 10:26   ` Thomas Gleixner
  2017-06-16 10:29   ` Thomas Gleixner
  2017-06-03 14:04 ` [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:04 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Remove a CPU from the affinity mask when it goes offline and add it
back when it returns.  In case the vetor was assigned only to the CPU
going offline it will be shutdown and re-started when the CPU
reappears.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/kernel/irq.c      |   3 +-
 include/linux/cpuhotplug.h |   1 +
 include/linux/irq.h        |   9 ++++
 kernel/cpu.c               |   6 +++
 kernel/irq/affinity.c      | 127 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index a54eac5d81b3..72c35ed534f1 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -453,7 +453,8 @@ void fixup_irqs(void)
 
 		data = irq_desc_get_irq_data(desc);
 		affinity = irq_data_get_affinity_mask(data);
-		if (!irq_has_action(irq) || irqd_is_per_cpu(data) ||
+		if (irqd_affinity_is_managed(data) ||
+		    !irq_has_action(irq) || irqd_is_per_cpu(data) ||
 		    cpumask_subset(affinity, cpu_online_mask)) {
 			raw_spin_unlock(&desc->lock);
 			continue;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 0f2a80377520..c15f22c54535 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -124,6 +124,7 @@ enum cpuhp_state {
 	CPUHP_AP_ONLINE_IDLE,
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
+	CPUHP_AP_IRQ_AFFINITY_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
 	CPUHP_AP_PERF_X86_ONLINE,
 	CPUHP_AP_PERF_X86_UNCORE_ONLINE,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f887351aa80e..ae15b8582685 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -216,6 +216,7 @@ enum {
 	IRQD_WAKEUP_ARMED		= (1 << 19),
 	IRQD_FORWARDED_TO_VCPU		= (1 << 20),
 	IRQD_AFFINITY_MANAGED		= (1 << 21),
+	IRQD_AFFINITY_SUSPENDED		= (1 << 22),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d)
 	__irqd_to_state(d) &= ~IRQD_ACTIVATED;
 }
 
+static inline bool irqd_affinity_is_suspended(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
@@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+int irq_affinity_online_cpu(unsigned int cpu);
+int irq_affinity_offline_cpu(unsigned int cpu);
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..ef0c5b63ca0d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -27,6 +27,7 @@
 #include <linux/smpboot.h>
 #include <linux/relay.h>
 #include <linux/slab.h>
+#include <linux/irq.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = {
 		.startup.single		= smpboot_unpark_threads,
 		.teardown.single	= NULL,
 	},
+	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
+		.name			= "irq/affinity:online",
+		.startup.single		= irq_affinity_online_cpu,
+		.teardown.single	= irq_affinity_offline_cpu,
+	},
 	[CPUHP_AP_PERF_ONLINE] = {
 		.name			= "perf:online",
 		.startup.single		= perf_event_init_cpu,
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 337e6ffba93f..e27ecfb4866f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,4 +1,7 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include <linux/interrupt.h>
 #include <linux/irq.h>
 #include <linux/kernel.h>
@@ -227,3 +230,125 @@ int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd)
 
 	return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 }
+
+static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
+				    unsigned int cpu)
+{
+	const struct cpumask *affinity;
+	struct irq_data *data;
+	struct irq_chip *chip;
+	unsigned long flags;
+	cpumask_var_t mask;
+
+	if (!desc)
+		return;
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	data = irq_desc_get_irq_data(desc);
+	affinity = irq_data_get_affinity_mask(data);
+	if (!irqd_affinity_is_managed(data) ||
+	    !irq_has_action(irq) ||
+	    !cpumask_test_cpu(cpu, affinity))
+		goto out_free_cpumask;
+
+	/*
+	 * The interrupt descriptor might have been cleaned up
+	 * already, but it is not yet removed from the radix tree
+	 */
+	chip = irq_data_get_irq_chip(data);
+	if (!chip)
+		goto out_free_cpumask;
+
+	if (WARN_ON_ONCE(!chip->irq_set_affinity))
+		goto out_free_cpumask;
+
+	cpumask_and(mask, affinity, cpu_online_mask);
+	cpumask_set_cpu(cpu, mask);
+	if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
+		irq_startup(desc, false);
+		irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
+	} else {
+		irq_affinity_set(irq, desc, mask);
+	}
+
+out_free_cpumask:
+	free_cpumask_var(mask);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int irq_affinity_online_cpu(unsigned int cpu)
+{
+	struct irq_desc *desc;
+	unsigned int irq;
+
+	irq_lock_sparse();
+	for_each_irq_desc(irq, desc)
+		irq_affinity_online_irq(irq, desc, cpu);
+	irq_unlock_sparse();
+	return 0;
+}
+
+static void irq_affinity_offline_irq(unsigned int irq, struct irq_desc *desc,
+				     unsigned int cpu)
+{
+	const struct cpumask *affinity;
+	struct irq_data *data;
+	struct irq_chip *chip;
+	unsigned long flags;
+	cpumask_var_t mask;
+
+	if (!desc)
+		return;
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	data = irq_desc_get_irq_data(desc);
+	affinity = irq_data_get_affinity_mask(data);
+	if (!irqd_affinity_is_managed(data) ||
+	    !irq_has_action(irq) ||
+	    irqd_has_set(data, IRQD_AFFINITY_SUSPENDED) ||
+	    !cpumask_test_cpu(cpu, affinity))
+		goto out_free_cpumask;
+
+	/*
+	 * The interrupt descriptor might have been cleaned up
+	 * already, but it is not yet removed from the radix tree
+	 */
+	chip = irq_data_get_irq_chip(data);
+	if (!chip)
+		goto out_free_cpumask;
+
+	if (WARN_ON_ONCE(!chip->irq_set_affinity))
+		goto out_free_cpumask;
+
+
+	cpumask_copy(mask, affinity);
+	cpumask_clear_cpu(cpu, mask);
+	if (cpumask_empty(mask)) {
+		irqd_set(data, IRQD_AFFINITY_SUSPENDED);
+		irq_shutdown(desc);
+	} else {
+		irq_affinity_set(irq, desc, mask);
+	}
+
+out_free_cpumask:
+	free_cpumask_var(mask);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int irq_affinity_offline_cpu(unsigned int cpu)
+{
+	struct irq_desc *desc;
+	unsigned int irq;
+
+	irq_lock_sparse();
+	for_each_irq_desc(irq, desc)
+		irq_affinity_offline_irq(irq, desc, cpu);
+	irq_unlock_sparse();
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2017-06-03 14:04 ` [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events Christoph Hellwig
@ 2017-06-03 14:04 ` Christoph Hellwig
  2017-06-04 15:11   ` Sagi Grimberg
  2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:04 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

This way we get a nice distribution independent of the current cpu
online / offline state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-cpumap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8e61e8640e17..5eaecd40f701 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -35,7 +35,6 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	const struct cpumask *online_mask = cpu_online_mask;
 	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
 	cpumask_var_t cpus;
 
@@ -44,7 +43,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 
 	cpumask_clear(cpus);
 	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
+	for_each_present_cpu(i) {
 		nr_cpus++;
 		first_sibling = get_first_sibling(i);
 		if (!cpumask_test_cpu(first_sibling, cpus))
@@ -54,7 +53,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 
 	queue = 0;
 	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
+		if (!cpumask_test_cpu(i, cpu_present_mask)) {
 			map[i] = 0;
 			continue;
 		}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2017-06-03 14:04 ` [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping Christoph Hellwig
@ 2017-06-03 14:04 ` Christoph Hellwig
  2017-06-04 15:11   ` Sagi Grimberg
                     ` (2 more replies)
  2017-06-03 14:04 ` [PATCH 8/8] nvme: allocate queues for all possible CPUs Christoph Hellwig
  2017-06-16  6:48 ` spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
  8 siblings, 3 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:04 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations.  Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c             | 120 +++++----------------------------------------
 block/blk-mq.h             |   5 --
 include/linux/cpuhotplug.h |   1 -
 3 files changed, 11 insertions(+), 115 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1bcccedcc74f..66ca9a090984 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -37,9 +37,6 @@
 #include "blk-wbt.h"
 #include "blk-mq-sched.h"
 
-static DEFINE_MUTEX(all_q_mutex);
-static LIST_HEAD(all_q_list);
-
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync);
@@ -1966,8 +1963,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		INIT_LIST_HEAD(&__ctx->rq_list);
 		__ctx->queue = q;
 
-		/* If the cpu isn't online, the cpu is mapped to first hctx */
-		if (!cpu_online(i))
+		/* If the cpu isn't present, the cpu is mapped to first hctx */
+		if (!cpu_present(i))
 			continue;
 
 		hctx = blk_mq_map_queue(q, i);
@@ -2010,8 +2007,7 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
 	}
 }
 
-static void blk_mq_map_swqueue(struct request_queue *q,
-			       const struct cpumask *online_mask)
+static void blk_mq_map_swqueue(struct request_queue *q)
 {
 	unsigned int i, hctx_idx;
 	struct blk_mq_hw_ctx *hctx;
@@ -2029,13 +2025,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	}
 
 	/*
-	 * Map software to hardware queues
+	 * Map software to hardware queues.
+	 *
+	 * If the cpu isn't present, the cpu is mapped to first hctx.
 	 */
-	for_each_possible_cpu(i) {
-		/* If the cpu isn't online, the cpu is mapped to first hctx */
-		if (!cpumask_test_cpu(i, online_mask))
-			continue;
-
+	for_each_present_cpu(i) {
 		hctx_idx = q->mq_map[i];
 		/* unmapped hw queue can be remapped after CPU topo changed */
 		if (!set->tags[hctx_idx] &&
@@ -2321,16 +2315,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		blk_queue_softirq_done(q, set->ops->complete);
 
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
-
-	get_online_cpus();
-	mutex_lock(&all_q_mutex);
-
-	list_add_tail(&q->all_q_node, &all_q_list);
 	blk_mq_add_queue_tag_set(set, q);
-	blk_mq_map_swqueue(q, cpu_online_mask);
-
-	mutex_unlock(&all_q_mutex);
-	put_online_cpus();
+	blk_mq_map_swqueue(q);
 
 	if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
 		int ret;
@@ -2356,18 +2342,12 @@ void blk_mq_free_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
-	mutex_lock(&all_q_mutex);
-	list_del_init(&q->all_q_node);
-	mutex_unlock(&all_q_mutex);
-
 	blk_mq_del_queue_tag_set(q);
-
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q,
-				const struct cpumask *online_mask)
+static void blk_mq_queue_reinit(struct request_queue *q)
 {
 	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
 
@@ -2380,76 +2360,12 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 	 * involves free and re-allocate memory, worthy doing?)
 	 */
 
-	blk_mq_map_swqueue(q, online_mask);
+	blk_mq_map_swqueue(q);
 
 	blk_mq_sysfs_register(q);
 	blk_mq_debugfs_register_hctxs(q);
 }
 
-/*
- * New online cpumask which is going to be set in this hotplug event.
- * Declare this cpumasks as global as cpu-hotplug operation is invoked
- * one-by-one and dynamically allocating this could result in a failure.
- */
-static struct cpumask cpuhp_online_new;
-
-static void blk_mq_queue_reinit_work(void)
-{
-	struct request_queue *q;
-
-	mutex_lock(&all_q_mutex);
-	/*
-	 * We need to freeze and reinit all existing queues.  Freezing
-	 * involves synchronous wait for an RCU grace period and doing it
-	 * one by one may take a long time.  Start freezing all queues in
-	 * one swoop and then wait for the completions so that freezing can
-	 * take place in parallel.
-	 */
-	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_freeze_queue_start(q);
-	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_freeze_queue_wait(q);
-
-	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_queue_reinit(q, &cpuhp_online_new);
-
-	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_unfreeze_queue(q);
-
-	mutex_unlock(&all_q_mutex);
-}
-
-static int blk_mq_queue_reinit_dead(unsigned int cpu)
-{
-	cpumask_copy(&cpuhp_online_new, cpu_online_mask);
-	blk_mq_queue_reinit_work();
-	return 0;
-}
-
-/*
- * Before hotadded cpu starts handling requests, new mappings must be
- * established.  Otherwise, these requests in hw queue might never be
- * dispatched.
- *
- * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
- * for CPU0, and ctx1 for CPU1).
- *
- * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
- * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
- *
- * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
- * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
- * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
- * ignored.
- */
-static int blk_mq_queue_reinit_prepare(unsigned int cpu)
-{
-	cpumask_copy(&cpuhp_online_new, cpu_online_mask);
-	cpumask_set_cpu(cpu, &cpuhp_online_new);
-	blk_mq_queue_reinit_work();
-	return 0;
-}
-
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
 	int i;
@@ -2660,7 +2576,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
-		blk_mq_queue_reinit(q, cpu_online_mask);
+		blk_mq_queue_reinit(q);
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
@@ -2876,24 +2792,10 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 }
 EXPORT_SYMBOL_GPL(blk_mq_poll);
 
-void blk_mq_disable_hotplug(void)
-{
-	mutex_lock(&all_q_mutex);
-}
-
-void blk_mq_enable_hotplug(void)
-{
-	mutex_unlock(&all_q_mutex);
-}
-
 static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
-
-	cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
-				  blk_mq_queue_reinit_prepare,
-				  blk_mq_queue_reinit_dead);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cc67b48e3551..558df56544d2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -56,11 +56,6 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 				struct list_head *list);
-/*
- * CPU hotplug helpers
- */
-void blk_mq_enable_hotplug(void);
-void blk_mq_disable_hotplug(void);
 
 /*
  * CPU -> queue mappings
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index c15f22c54535..7f815d915977 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -58,7 +58,6 @@ enum cpuhp_state {
 	CPUHP_XEN_EVTCHN_PREPARE,
 	CPUHP_ARM_SHMOBILE_SCU_PREPARE,
 	CPUHP_SH_SH3X_PREPARE,
-	CPUHP_BLK_MQ_PREPARE,
 	CPUHP_NET_FLOW_PREPARE,
 	CPUHP_TOPOLOGY_PREPARE,
 	CPUHP_NET_IUCV_PREPARE,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/8] nvme: allocate queues for all possible CPUs
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
@ 2017-06-03 14:04 ` Christoph Hellwig
  2017-06-04 15:13   ` Sagi Grimberg
  2017-06-16  6:48 ` spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
  8 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-03 14:04 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Unlike most drіvers that simply pass the maximum possible vectors to
pci_alloc_irq_vectors NVMe needs to configure the device before allocting
the vectors, so it needs a manual update for the new scheme of using
all present CPUs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d52701df7245..4152d93fbbef 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1525,7 +1525,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues, size;
 
-	nr_io_queues = num_online_cpus();
+	nr_io_queues = num_present_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
 	if (result < 0)
 		return result;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping
  2017-06-03 14:04 ` [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping Christoph Hellwig
@ 2017-06-04 15:11   ` Sagi Grimberg
  0 siblings, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:11 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
@ 2017-06-04 15:11   ` Sagi Grimberg
  2017-06-07  9:10   ` Ming Lei
  2017-06-07 22:04   ` Omar Sandoval
  2 siblings, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:11 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Nice cleanup!

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] nvme: allocate queues for all possible CPUs
  2017-06-03 14:04 ` [PATCH 8/8] nvme: allocate queues for all possible CPUs Christoph Hellwig
@ 2017-06-04 15:13   ` Sagi Grimberg
  0 siblings, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:13 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-nvme, linux-block, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs
  2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
@ 2017-06-04 15:14   ` Sagi Grimberg
  2017-06-17 23:21   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:14 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Looks good to me,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] genirq: move pending helpers to internal.h
  2017-06-03 14:03 ` [PATCH 2/8] genirq: move pending helpers to internal.h Christoph Hellwig
@ 2017-06-04 15:15   ` Sagi Grimberg
  0 siblings, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:15 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Looks good to me,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
@ 2017-06-04 15:15   ` Sagi Grimberg
  2017-06-16 10:23   ` Thomas Gleixner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:15 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Looks good to me,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs
  2017-06-03 14:03 ` [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs Christoph Hellwig
@ 2017-06-04 15:17   ` Sagi Grimberg
  2017-06-22 17:10   ` [tip:irq/core] genirq/affinity: Assign " tip-bot for Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Sagi Grimberg @ 2017-06-04 15:17 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme



On 03/06/17 17:03, Christoph Hellwig wrote:
> Currently we only assign spread vectors to online CPUs, which ties the
> IRQ mapping to the currently online devices and doesn't deal nicely with
> the fact that CPUs could come and go rapidly due to e.g. power management.
> 
> Instead assign vectors to all present CPUs to avoid this churn.
> 
> For this we have to build a map of all possible CPUs for a give node, as

s/give/given/

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
  2017-06-04 15:11   ` Sagi Grimberg
@ 2017-06-07  9:10   ` Ming Lei
  2017-06-07 19:06     ` Christoph Hellwig
  2017-06-07 22:04   ` Omar Sandoval
  2 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-06-07  9:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Jens Axboe, Keith Busch, linux-nvme,
	linux-block, linux-kernel

On Sat, Jun 03, 2017 at 04:04:02PM +0200, Christoph Hellwig wrote:
> Currently we only create hctx for online CPUs, which can lead to a lot
> of churn due to frequent soft offline / online operations.  Instead
> allocate one for each present CPU to avoid this and dramatically simplify
> the code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c             | 120 +++++----------------------------------------
>  block/blk-mq.h             |   5 --
>  include/linux/cpuhotplug.h |   1 -
>  3 files changed, 11 insertions(+), 115 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1bcccedcc74f..66ca9a090984 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -37,9 +37,6 @@
>  #include "blk-wbt.h"
>  #include "blk-mq-sched.h"
>  
> -static DEFINE_MUTEX(all_q_mutex);
> -static LIST_HEAD(all_q_list);
> -
>  static void blk_mq_poll_stats_start(struct request_queue *q);
>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>  static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync);
> @@ -1966,8 +1963,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>  		INIT_LIST_HEAD(&__ctx->rq_list);
>  		__ctx->queue = q;
>  
> -		/* If the cpu isn't online, the cpu is mapped to first hctx */
> -		if (!cpu_online(i))
> +		/* If the cpu isn't present, the cpu is mapped to first hctx */
> +		if (!cpu_present(i))
>  			continue;
>  
>  		hctx = blk_mq_map_queue(q, i);
> @@ -2010,8 +2007,7 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
>  	}
>  }
>  
> -static void blk_mq_map_swqueue(struct request_queue *q,
> -			       const struct cpumask *online_mask)
> +static void blk_mq_map_swqueue(struct request_queue *q)
>  {
>  	unsigned int i, hctx_idx;
>  	struct blk_mq_hw_ctx *hctx;
> @@ -2029,13 +2025,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>  	}
>  
>  	/*
> -	 * Map software to hardware queues
> +	 * Map software to hardware queues.
> +	 *
> +	 * If the cpu isn't present, the cpu is mapped to first hctx.
>  	 */
> -	for_each_possible_cpu(i) {
> -		/* If the cpu isn't online, the cpu is mapped to first hctx */
> -		if (!cpumask_test_cpu(i, online_mask))
> -			continue;
> -
> +	for_each_present_cpu(i) {
>  		hctx_idx = q->mq_map[i];
>  		/* unmapped hw queue can be remapped after CPU topo changed */
>  		if (!set->tags[hctx_idx] &&
> @@ -2321,16 +2315,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  		blk_queue_softirq_done(q, set->ops->complete);
>  
>  	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
> -
> -	get_online_cpus();
> -	mutex_lock(&all_q_mutex);
> -
> -	list_add_tail(&q->all_q_node, &all_q_list);
>  	blk_mq_add_queue_tag_set(set, q);
> -	blk_mq_map_swqueue(q, cpu_online_mask);
> -
> -	mutex_unlock(&all_q_mutex);
> -	put_online_cpus();
> +	blk_mq_map_swqueue(q);
>  
>  	if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
>  		int ret;
> @@ -2356,18 +2342,12 @@ void blk_mq_free_queue(struct request_queue *q)
>  {
>  	struct blk_mq_tag_set	*set = q->tag_set;
>  
> -	mutex_lock(&all_q_mutex);
> -	list_del_init(&q->all_q_node);
> -	mutex_unlock(&all_q_mutex);
> -
>  	blk_mq_del_queue_tag_set(q);
> -
>  	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
>  }
>  
>  /* Basically redo blk_mq_init_queue with queue frozen */
> -static void blk_mq_queue_reinit(struct request_queue *q,
> -				const struct cpumask *online_mask)
> +static void blk_mq_queue_reinit(struct request_queue *q)
>  {
>  	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
>  
> @@ -2380,76 +2360,12 @@ static void blk_mq_queue_reinit(struct request_queue *q,
>  	 * involves free and re-allocate memory, worthy doing?)
>  	 */
>  
> -	blk_mq_map_swqueue(q, online_mask);
> +	blk_mq_map_swqueue(q);
>  
>  	blk_mq_sysfs_register(q);
>  	blk_mq_debugfs_register_hctxs(q);
>  }
>  
> -/*
> - * New online cpumask which is going to be set in this hotplug event.
> - * Declare this cpumasks as global as cpu-hotplug operation is invoked
> - * one-by-one and dynamically allocating this could result in a failure.
> - */
> -static struct cpumask cpuhp_online_new;
> -
> -static void blk_mq_queue_reinit_work(void)
> -{
> -	struct request_queue *q;
> -
> -	mutex_lock(&all_q_mutex);
> -	/*
> -	 * We need to freeze and reinit all existing queues.  Freezing
> -	 * involves synchronous wait for an RCU grace period and doing it
> -	 * one by one may take a long time.  Start freezing all queues in
> -	 * one swoop and then wait for the completions so that freezing can
> -	 * take place in parallel.
> -	 */
> -	list_for_each_entry(q, &all_q_list, all_q_node)
> -		blk_freeze_queue_start(q);
> -	list_for_each_entry(q, &all_q_list, all_q_node)
> -		blk_mq_freeze_queue_wait(q);
> -
> -	list_for_each_entry(q, &all_q_list, all_q_node)
> -		blk_mq_queue_reinit(q, &cpuhp_online_new);
> -
> -	list_for_each_entry(q, &all_q_list, all_q_node)
> -		blk_mq_unfreeze_queue(q);
> -
> -	mutex_unlock(&all_q_mutex);
> -}
> -
> -static int blk_mq_queue_reinit_dead(unsigned int cpu)
> -{
> -	cpumask_copy(&cpuhp_online_new, cpu_online_mask);
> -	blk_mq_queue_reinit_work();
> -	return 0;
> -}
> -
> -/*
> - * Before hotadded cpu starts handling requests, new mappings must be
> - * established.  Otherwise, these requests in hw queue might never be
> - * dispatched.
> - *
> - * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
> - * for CPU0, and ctx1 for CPU1).
> - *
> - * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
> - * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
> - *
> - * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
> - * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> - * ignored.
> - */
> -static int blk_mq_queue_reinit_prepare(unsigned int cpu)
> -{
> -	cpumask_copy(&cpuhp_online_new, cpu_online_mask);
> -	cpumask_set_cpu(cpu, &cpuhp_online_new);
> -	blk_mq_queue_reinit_work();
> -	return 0;
> -}
> -
>  static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
>  {
>  	int i;
> @@ -2660,7 +2576,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	blk_mq_update_queue_map(set);
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		blk_mq_realloc_hw_ctxs(set, q);
> -		blk_mq_queue_reinit(q, cpu_online_mask);
> +		blk_mq_queue_reinit(q);
>  	}
>  
>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
> @@ -2876,24 +2792,10 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_poll);
>  
> -void blk_mq_disable_hotplug(void)
> -{
> -	mutex_lock(&all_q_mutex);
> -}
> -
> -void blk_mq_enable_hotplug(void)
> -{
> -	mutex_unlock(&all_q_mutex);
> -}
> -
>  static int __init blk_mq_init(void)
>  {
>  	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
>  				blk_mq_hctx_notify_dead);
> -
> -	cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
> -				  blk_mq_queue_reinit_prepare,
> -				  blk_mq_queue_reinit_dead);

Hi Christoph,

One thing not sure is that we may need to handle new CPU hotplug
after initialization. Without the CPU hotplug handler, system may
not scale well when more CPUs are added to sockets.

Another thing is that I don't see how NVMe handles this situation,
blk_mq_update_nr_hw_queues() is called in nvme_reset_work(), so
that means RESET need to be triggered after new CPUs are added to
system? I have tried to add new CPUs runtime on Qemu, and not see
new hw queues are added no matter this patchset is applied or not.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-07  9:10   ` Ming Lei
@ 2017-06-07 19:06     ` Christoph Hellwig
  2017-06-08  2:28       ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-07 19:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, Keith Busch,
	linux-nvme, linux-block, linux-kernel

On Wed, Jun 07, 2017 at 05:10:46PM +0800, Ming Lei wrote:
> One thing not sure is that we may need to handle new CPU hotplug
> after initialization. Without the CPU hotplug handler, system may
> not scale well when more CPUs are added to sockets.

Adding physical CPUs to sockets is a very rare activity, and we
should not optimize for it.  Taking CPUs that are physically present
offline and online is the case that is interesting, and that's what
this patchset changes.

> Another thing is that I don't see how NVMe handles this situation,
> blk_mq_update_nr_hw_queues() is called in nvme_reset_work(), so
> that means RESET need to be triggered after new CPUs are added to
> system?

Yes.

> I have tried to add new CPUs runtime on Qemu, and not see
> new hw queues are added no matter this patchset is applied or not.

Do you even see the CPUs in your VM?  For physical hotplug you'll
need to reserve spots in the cpumap beforehand.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
  2017-06-04 15:11   ` Sagi Grimberg
  2017-06-07  9:10   ` Ming Lei
@ 2017-06-07 22:04   ` Omar Sandoval
  2017-06-08  6:58     ` Christoph Hellwig
  2 siblings, 1 reply; 31+ messages in thread
From: Omar Sandoval @ 2017-06-07 22:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Jens Axboe, Keith Busch, linux-nvme,
	linux-block, linux-kernel

On Sat, Jun 03, 2017 at 04:04:02PM +0200, Christoph Hellwig wrote:
> Currently we only create hctx for online CPUs, which can lead to a lot
> of churn due to frequent soft offline / online operations.  Instead
> allocate one for each present CPU to avoid this and dramatically simplify
> the code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Oh man, this cleanup is great. Did you run blktests on this? block/008
does a bunch of hotplugging while I/O is running.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-07 19:06     ` Christoph Hellwig
@ 2017-06-08  2:28       ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-06-08  2:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Jens Axboe, Keith Busch, linux-nvme,
	linux-block, linux-kernel

On Wed, Jun 07, 2017 at 09:06:59PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 07, 2017 at 05:10:46PM +0800, Ming Lei wrote:
> > One thing not sure is that we may need to handle new CPU hotplug
> > after initialization. Without the CPU hotplug handler, system may
> > not scale well when more CPUs are added to sockets.
> 
> Adding physical CPUs to sockets is a very rare activity, and we
> should not optimize for it.  Taking CPUs that are physically present
> offline and online is the case that is interesting, and that's what
> this patchset changes.

Yeah, I understand. It depends if there are real use cases in which
system can't be rebooted, but sometimes CPUs need to be added or removed
for some reasons. Looks CONFIG_HOTPLUG_CPU addresses this requirement,

>From searching on google, looks there are some for virtualization,
and both Qemu and Vmware support that.

> 
> > Another thing is that I don't see how NVMe handles this situation,
> > blk_mq_update_nr_hw_queues() is called in nvme_reset_work(), so
> > that means RESET need to be triggered after new CPUs are added to
> > system?
> 
> Yes.

Unfortunately I don't see the Reset comes when new CPU cores becomes
online.

> 
> > I have tried to add new CPUs runtime on Qemu, and not see
> > new hw queues are added no matter this patchset is applied or not.
> 
> Do you even see the CPUs in your VM?  For physical hotplug you'll
> need to reserve spots in the cpumap beforehand.

Yes, I can add the new CPUs in console of Qemu, and these CPUs become
present first in VM, then switched online via command line after the hotplug,
but number of hw queues doesn't change.

Please see the following log:

root@ming:/sys/kernel/debug/block# echo "before adding one CPU"
before adding one CPU
root@ming:/sys/kernel/debug/block#
root@ming:/sys/kernel/debug/block# lscpu | head -n 10
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
root@ming:/sys/kernel/debug/block# ls nvme0n1/
hctx0  hctx1  hctx2  hctx3  poll_stat  requeue_list  state
root@ming:/sys/kernel/debug/block#
root@ming:/sys/kernel/debug/block# echo "one CPU will be added"
one CPU will be added
root@ming:/sys/kernel/debug/block#
device: 'cpu4': device_add
bus: 'cpu': add device cpu4
PM: Adding info for cpu:cpu4
CPU4 has been hot-added
bus: 'cpu': driver_probe_device: matched device cpu4 with driver processor
bus: 'cpu': really_probe: probing driver processor with device cpu4
processor cpu4: no default pinctrl state
devices_kset: Moving cpu4 to end of list
driver: 'processor': driver_bound: bound to device 'cpu4'
bus: 'cpu': really_probe: bound device cpu4 to driver processor

root@ming:/sys/kernel/debug/block# echo 1 > /sys/devices/system/cpu/cpu4/online
smpboot: Booting Node 0 Processor 4 APIC 0x4/sys/devices/system/cpu/cpu4/online
kvm-clock: cpu 4, msr 2:7ff5e101, secondary cpu clock
TSC ADJUST compensate: CPU4 observed 349440008297 warp. Adjust: 2147483647
TSC ADJUST compensate: CPU4 observed 347292524597 warp. Adjust: 2147483647
TSC synchronization [CPU#1 -> CPU#4]:
Measured 347292524561 cycles TSC warp between CPUs, turning off TSC clock.
tsc: Marking TSC unstable due to check_tsc_sync_source failed
KVM setup async PF for cpu 4
kvm-stealtime: cpu 4, msr 27fd0d900
Will online and init hotplugged CPU: 4
device: 'cooling_device4': device_add
PM: Adding info for No Bus:cooling_device4
device: 'cache': device_add
PM: Adding info for No Bus:cache
device: 'index0': device_add
PM: Adding info for No Bus:index0
device: 'index1': device_add
PM: Adding info for No Bus:index1
device: 'index2': device_add
PM: Adding info for No Bus:index2
device: 'index3': device_add
PM: Adding info for No Bus:index3
device: 'machinecheck4': device_add
bus: 'machinecheck': add device machinecheck4
PM: Adding info for machinecheck:machinecheck4
root@ming:/sys/kernel/debug/block# lscpu | head -n 10
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                5
On-line CPU(s) list:   0-4
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             3
NUMA node(s):          1
Vendor ID:             GenuineIntel
root@ming:/sys/kernel/debug/block# ls nvme0n1/
hctx0  hctx1  hctx2  hctx3  poll_stat  requeue_list  state


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] blk-mq: create hctx for each present CPU
  2017-06-07 22:04   ` Omar Sandoval
@ 2017-06-08  6:58     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-08  6:58 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Christoph Hellwig, Thomas Gleixner, Jens Axboe, Keith Busch,
	linux-nvme, linux-block, linux-kernel

On Wed, Jun 07, 2017 at 03:04:11PM -0700, Omar Sandoval wrote:
> On Sat, Jun 03, 2017 at 04:04:02PM +0200, Christoph Hellwig wrote:
> > Currently we only create hctx for online CPUs, which can lead to a lot
> > of churn due to frequent soft offline / online operations.  Instead
> > allocate one for each present CPU to avoid this and dramatically simplify
> > the code.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Oh man, this cleanup is great. Did you run blktests on this? block/008
> does a bunch of hotplugging while I/O is running.

I haven't run blktests yet, in fact when I did the work blktests didn't
exist yet.   But thanks for the reminder, I'll run it now.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: spread MSI(-X) vectors to all possible CPUs V2
  2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2017-06-03 14:04 ` [PATCH 8/8] nvme: allocate queues for all possible CPUs Christoph Hellwig
@ 2017-06-16  6:48 ` Christoph Hellwig
  2017-06-16  7:28   ` Thomas Gleixner
  8 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-06-16  6:48 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Keith Busch, linux-block, linux-kernel, linux-nvme

Thomas,

can you take a look at the generic patches as they are the required
base for the block work?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: spread MSI(-X) vectors to all possible CPUs V2
  2017-06-16  6:48 ` spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
@ 2017-06-16  7:28   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16  7:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-block, linux-kernel, linux-nvme

On Fri, 16 Jun 2017, Christoph Hellwig wrote:
> can you take a look at the generic patches as they are the required
> base for the block work?

It's next on my ever growing todo list....

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
  2017-06-04 15:15   ` Sagi Grimberg
@ 2017-06-16 10:23   ` Thomas Gleixner
  2017-06-16 11:08   ` Thomas Gleixner
  2017-06-17 23:14   ` Thomas Gleixner
  3 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16 10:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)

This should be named irq_affinity_force() because it circumvents the 'move
in irq context' mechanism. I'll do that myself. No need to resend.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
  2017-06-03 14:04 ` [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events Christoph Hellwig
@ 2017-06-16 10:26   ` Thomas Gleixner
  2017-06-16 10:29   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16 10:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> +				    unsigned int cpu)
> +{
> +
> +	cpumask_and(mask, affinity, cpu_online_mask);
> +	cpumask_set_cpu(cpu, mask);
> +	if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> +		irq_startup(desc, false);
> +		irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> +	} else {
> +		irq_affinity_set(irq, desc, mask);

I don't think this should be forced. In that case we really can use the
regular mechanism.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events
  2017-06-03 14:04 ` [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events Christoph Hellwig
  2017-06-16 10:26   ` Thomas Gleixner
@ 2017-06-16 10:29   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16 10:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> +				    unsigned int cpu)
> +{
> +	const struct cpumask *affinity;
> +	struct irq_data *data;
> +	struct irq_chip *chip;
> +	unsigned long flags;
> +	cpumask_var_t mask;
> +
> +	if (!desc)
> +		return;
> +	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> +		return;

That's silly. Why want you to do that for each irq descriptor? That's
outright silly. Either you allocated it in irq_affinity_[on|off]line_cpu()
once or just make it a static cpumask. Lemme fix that up.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
  2017-06-04 15:15   ` Sagi Grimberg
  2017-06-16 10:23   ` Thomas Gleixner
@ 2017-06-16 11:08   ` Thomas Gleixner
  2017-06-16 12:00     ` Thomas Gleixner
  2017-06-17 23:14   ` Thomas Gleixner
  3 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16 11:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> +{
> +	struct irq_data *data = irq_desc_get_irq_data(desc);
> +	struct irq_chip *chip = irq_data_get_irq_chip(data);
> +	bool ret = false;
> +
> +	if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> +		chip->irq_mask(data);
> +
> +	if (chip->irq_set_affinity) {
> +		if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> +			pr_crit("IRQ %d set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
> +		ret = true;
> +	}
> +
> +	/*
> +	 * We unmask if the irq was not marked masked by the core code.
> +	 * That respects the lazy irq disable behaviour.
> +	 */
> +	if (!irq_can_move_pcntxt(data) &&
> +	    !irqd_irq_masked(data) && chip->irq_unmask)
> +		chip->irq_unmask(data);

There is another issue with this. Nothing updates the affinity mask in
irq_desc, when we just invoke the chip callback. Let me have a look.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-16 11:08   ` Thomas Gleixner
@ 2017-06-16 12:00     ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-16 12:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Fri, 16 Jun 2017, Thomas Gleixner wrote:
> On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> > +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> > +{
> > +	struct irq_data *data = irq_desc_get_irq_data(desc);
> > +	struct irq_chip *chip = irq_data_get_irq_chip(data);
> > +	bool ret = false;
> > +
> > +	if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> > +		chip->irq_mask(data);
> > +
> > +	if (chip->irq_set_affinity) {
> > +		if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> > +			pr_crit("IRQ %d set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
> > +		ret = true;
> > +	}
> > +
> > +	/*
> > +	 * We unmask if the irq was not marked masked by the core code.
> > +	 * That respects the lazy irq disable behaviour.
> > +	 */
> > +	if (!irq_can_move_pcntxt(data) &&
> > +	    !irqd_irq_masked(data) && chip->irq_unmask)
> > +		chip->irq_unmask(data);
> 
> There is another issue with this. Nothing updates the affinity mask in
> irq_desc, when we just invoke the chip callback. Let me have a look.

Indeed. So that magic you do in the next patches (the hotplug callbacks)
only work proper for affinity masks with a single cpu set.

The problem is that we don't have a distinction between the 'possible'
(e.g. set by /proc/irq/affinity) and the effective affinity mask.

Needs more thought.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper
  2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
                     ` (2 preceding siblings ...)
  2017-06-16 11:08   ` Thomas Gleixner
@ 2017-06-17 23:14   ` Thomas Gleixner
  3 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-17 23:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> +{
> +	struct irq_data *data = irq_desc_get_irq_data(desc);
> +	struct irq_chip *chip = irq_data_get_irq_chip(data);
> +	bool ret = false;
> +
> +	if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> +		chip->irq_mask(data);
> +
> +	if (chip->irq_set_affinity) {
> +		if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> +			pr_crit("IRQ %d set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
> +		ret = true;
> +	}
> +
> +	/*
> +	 * We unmask if the irq was not marked masked by the core code.
> +	 * That respects the lazy irq disable behaviour.
> +	 */
> +	if (!irq_can_move_pcntxt(data) &&
> +	    !irqd_irq_masked(data) && chip->irq_unmask)
> +		chip->irq_unmask(data);
> +
> +	return ret;
> +}

That needs even more care as this does not include the handling for move in
progress, which will make your affinity setting fail.

Ideally we include that managed shutdown magic into fixup_irqs(), but
that's arch specific. So that needs the long dragged out update to the
generic irq cpuhotplug code to handle the x86'isms proper.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs
  2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
  2017-06-04 15:14   ` Sagi Grimberg
@ 2017-06-17 23:21   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2017-06-17 23:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block, linux-kernel

On Sat, 3 Jun 2017, Christoph Hellwig wrote:

> This will allow us to spread MSI/MSI-X affinity over all present CPUs and
> thus better deal with systems where cpus are take on and offline all the
> time.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  kernel/irq/manage.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 070be980c37a..5c25d4a5dc46 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -361,17 +361,17 @@ static int setup_affinity(struct irq_desc *desc, struct cpumask *mask)
>  	if (irqd_affinity_is_managed(&desc->irq_data) ||
>  	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
>  		if (cpumask_intersects(desc->irq_common_data.affinity,
> -				       cpu_online_mask))
> +				       cpu_present_mask))
>  			set = desc->irq_common_data.affinity;
>  		else
>  			irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
>  	}
>  
> -	cpumask_and(mask, cpu_online_mask, set);
> +	cpumask_and(mask, cpu_present_mask, set);
>  	if (node != NUMA_NO_NODE) {
>  		const struct cpumask *nodemask = cpumask_of_node(node);
>  
> -		/* make sure at least one of the cpus in nodemask is online */
> +		/* make sure at least one of the cpus in nodemask is present */
>  		if (cpumask_intersects(mask, nodemask))
>  			cpumask_and(mask, mask, nodemask);
>  	}

This is a dangerous one. It might break existing setups subtly. Assume the
AFFINITY_SET flag is set, then this tries to preserve the user supplied
affinity mask. So that might end up with some random mask which does not
contain any online CPU. Not what we want.

We really need to seperate the handling of the managed interrupts from the
regular ones. Otherwise we end up with hard to debug issues. Cramming stuff
into the existing code, does not solve the problem, but it creates new ones.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [tip:irq/core] genirq/affinity: Assign vectors to all present CPUs
  2017-06-03 14:03 ` [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs Christoph Hellwig
  2017-06-04 15:17   ` Sagi Grimberg
@ 2017-06-22 17:10   ` tip-bot for Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: tip-bot for Christoph Hellwig @ 2017-06-22 17:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, sagi, axboe, keith.busch, hch, hpa, linux-kernel, peterz,
	mpe, marc.zyngier, mingo

Commit-ID:  9a0ef98e186d86fb3c1ff3ec267a76f067005f74
Gitweb:     http://git.kernel.org/tip/9a0ef98e186d86fb3c1ff3ec267a76f067005f74
Author:     Christoph Hellwig <hch@lst.de>
AuthorDate: Tue, 20 Jun 2017 01:37:55 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 22 Jun 2017 18:21:26 +0200

genirq/affinity: Assign vectors to all present CPUs

Currently the irq vector spread algorithm is restricted to online CPUs,
which ties the IRQ mapping to the currently online devices and doesn't deal
nicely with the fact that CPUs could come and go rapidly due to e.g. power
management.

Instead assign vectors to all present CPUs to avoid this churn.

Build a map of all possible CPUs for a given node, as the architectures
only provide a map of all onlines CPUs. Do this dynamically on each call
for the vector assingments, which is a bit suboptimal and could be
optimized in the future by provinding a mapping from the arch code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-nvme@lists.infradead.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170603140403.27379-5-hch@lst.de

---
 kernel/irq/affinity.c | 76 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 63 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e2d356d..d2747f9 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,4 +1,7 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
@@ -35,13 +38,54 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 	}
 }
 
-static int get_nodes_in_cpumask(const struct cpumask *mask, nodemask_t *nodemsk)
+static cpumask_var_t *alloc_node_to_present_cpumask(void)
+{
+	cpumask_var_t *masks;
+	int node;
+
+	masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+	if (!masks)
+		return NULL;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
+			goto out_unwind;
+	}
+
+	return masks;
+
+out_unwind:
+	while (--node >= 0)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+	return NULL;
+}
+
+static void free_node_to_present_cpumask(cpumask_var_t *masks)
+{
+	int node;
+
+	for (node = 0; node < nr_node_ids; node++)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+}
+
+static void build_node_to_present_cpumask(cpumask_var_t *masks)
+{
+	int cpu;
+
+	for_each_present_cpu(cpu)
+		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
+}
+
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+				const struct cpumask *mask, nodemask_t *nodemsk)
 {
 	int n, nodes = 0;
 
 	/* Calculate the number of nodes in the supplied affinity mask */
-	for_each_online_node(n) {
-		if (cpumask_intersects(mask, cpumask_of_node(n))) {
+	for_each_node(n) {
+		if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
 			node_set(n, *nodemsk);
 			nodes++;
 		}
@@ -64,7 +108,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	int last_affv = affv + affd->pre_vectors;
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct cpumask *masks;
-	cpumask_var_t nmsk;
+	cpumask_var_t nmsk, *node_to_present_cpumask;
 
 	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
 		return NULL;
@@ -73,13 +117,19 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	if (!masks)
 		goto out;
 
+	node_to_present_cpumask = alloc_node_to_present_cpumask();
+	if (!node_to_present_cpumask)
+		goto out;
+
 	/* Fill out vectors at the beginning that don't need affinity */
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
 		cpumask_copy(masks + curvec, irq_default_affinity);
 
 	/* Stabilize the cpumasks */
 	get_online_cpus();
-	nodes = get_nodes_in_cpumask(cpu_online_mask, &nodemsk);
+	build_node_to_present_cpumask(node_to_present_cpumask);
+	nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
+				     &nodemsk);
 
 	/*
 	 * If the number of nodes in the mask is greater than or equal the
@@ -87,7 +137,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	 */
 	if (affv <= nodes) {
 		for_each_node_mask(n, nodemsk) {
-			cpumask_copy(masks + curvec, cpumask_of_node(n));
+			cpumask_copy(masks + curvec,
+				     node_to_present_cpumask[n]);
 			if (++curvec == last_affv)
 				break;
 		}
@@ -101,7 +152,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
 		/* Get the cpus on this node which are in the mask */
-		cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
+		cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
@@ -133,6 +184,7 @@ done:
 	/* Fill out vectors at the end that don't need affinity */
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(masks + curvec, irq_default_affinity);
+	free_node_to_present_cpumask(node_to_present_cpumask);
 out:
 	free_cpumask_var(nmsk);
 	return masks;
@@ -147,12 +199,10 @@ int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd)
 {
 	int resv = affd->pre_vectors + affd->post_vectors;
 	int vecs = maxvec - resv;
-	int cpus;
+	int ret;
 
-	/* Stabilize the cpumasks */
 	get_online_cpus();
-	cpus = cpumask_weight(cpu_online_mask);
+	ret = min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 	put_online_cpus();
-
-	return min(cpus, vecs) + resv;
+	return ret;
 }

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-06-22 17:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-03 14:03 spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
2017-06-03 14:03 ` [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs Christoph Hellwig
2017-06-04 15:14   ` Sagi Grimberg
2017-06-17 23:21   ` Thomas Gleixner
2017-06-03 14:03 ` [PATCH 2/8] genirq: move pending helpers to internal.h Christoph Hellwig
2017-06-04 15:15   ` Sagi Grimberg
2017-06-03 14:03 ` [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper Christoph Hellwig
2017-06-04 15:15   ` Sagi Grimberg
2017-06-16 10:23   ` Thomas Gleixner
2017-06-16 11:08   ` Thomas Gleixner
2017-06-16 12:00     ` Thomas Gleixner
2017-06-17 23:14   ` Thomas Gleixner
2017-06-03 14:03 ` [PATCH 4/8] genirq/affinity: assign vectors to all present CPUs Christoph Hellwig
2017-06-04 15:17   ` Sagi Grimberg
2017-06-22 17:10   ` [tip:irq/core] genirq/affinity: Assign " tip-bot for Christoph Hellwig
2017-06-03 14:04 ` [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events Christoph Hellwig
2017-06-16 10:26   ` Thomas Gleixner
2017-06-16 10:29   ` Thomas Gleixner
2017-06-03 14:04 ` [PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping Christoph Hellwig
2017-06-04 15:11   ` Sagi Grimberg
2017-06-03 14:04 ` [PATCH 7/8] blk-mq: create hctx for each present CPU Christoph Hellwig
2017-06-04 15:11   ` Sagi Grimberg
2017-06-07  9:10   ` Ming Lei
2017-06-07 19:06     ` Christoph Hellwig
2017-06-08  2:28       ` Ming Lei
2017-06-07 22:04   ` Omar Sandoval
2017-06-08  6:58     ` Christoph Hellwig
2017-06-03 14:04 ` [PATCH 8/8] nvme: allocate queues for all possible CPUs Christoph Hellwig
2017-06-04 15:13   ` Sagi Grimberg
2017-06-16  6:48 ` spread MSI(-X) vectors to all possible CPUs V2 Christoph Hellwig
2017-06-16  7:28   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).