linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread
@ 2021-07-19  9:57 Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 1/7] genirq/affinity: remove the 'firstvec' parameter from irq_build_affinity_masks Ming Lei
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

Hello,

irq_build_affinity_masks() actually grouping CPUs evenly into each managed
irq vector according to NUMA and CPU locality, and it is reasonable to abstract
one generic API for grouping CPUs evenly, the idea is suggested by Thomas
Gleixner.

group_cpus_evenly() is abstracted and put into lib/, so blk-mq can re-use
it to build default queue mapping.

Please comments!

Thanks,

Ming Lei (7):
  genirq/affinity: remove the 'firstvec' parameter from
    irq_build_affinity_masks
  genirq/affinity: pass affinity managed mask array to
    irq_build_affinity_masks
  genirq/affinity: don't pass irq_affinity_desc array to
    irq_build_affinity_masks
  genirq/affinity: rename irq_build_affinity_masks as group_cpus_evenly
  genirq/affinity: move group_cpus_evenly() into lib/
  lib/group_cpus: allow to group cpus in case of !CONFIG_SMP
  blk-mq: build default queue map via group_cpus_evenly()

 block/blk-mq-cpumap.c      |  64 ++----
 include/linux/group_cpus.h |  28 +++
 kernel/irq/affinity.c      | 404 +-----------------------------------
 lib/Makefile               |   2 +
 lib/group_cpus.c           | 413 +++++++++++++++++++++++++++++++++++++
 5 files changed, 465 insertions(+), 446 deletions(-)
 create mode 100644 include/linux/group_cpus.h
 create mode 100644 lib/group_cpus.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/7] genirq/affinity: remove the 'firstvec' parameter from irq_build_affinity_masks
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 2/7] genirq/affinity: pass affinity managed mask array to irq_build_affinity_masks Ming Lei
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

The 'firstvec' parameter is always same with the parameter of
'startvec', so use 'startvec' directly inside irq_build_affinity_masks().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 735f697d7d15..0c535591f2f2 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -336,10 +336,10 @@ static int __irq_build_affinity_masks(unsigned int startvec,
  *	2) spread other possible CPUs on these vectors
  */
 static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
-				    unsigned int firstvec,
 				    struct irq_affinity_desc *masks)
 {
 	unsigned int curvec = startvec, nr_present = 0, nr_others = 0;
+	unsigned int firstvec = startvec;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret = -ENOMEM;
@@ -469,8 +469,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 		unsigned int this_vecs = affd->set_size[i];
 		int ret;
 
-		ret = irq_build_affinity_masks(curvec, this_vecs,
-					       curvec, masks);
+		ret = irq_build_affinity_masks(curvec, this_vecs, masks);
 		if (ret) {
 			kfree(masks);
 			return NULL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/7] genirq/affinity: pass affinity managed mask array to irq_build_affinity_masks
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 1/7] genirq/affinity: remove the 'firstvec' parameter from irq_build_affinity_masks Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 3/7] genirq/affinity: don't pass irq_affinity_desc " Ming Lei
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

Pass affinity managed mask array to irq_build_affinity_masks() so that
index of the first affinity managed vector is always zero, then we can
simplify the implementation a bit.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 0c535591f2f2..c8da750f393b 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -246,14 +246,13 @@ static void alloc_nodes_vectors(unsigned int numvecs,
 
 static int __irq_build_affinity_masks(unsigned int startvec,
 				      unsigned int numvecs,
-				      unsigned int firstvec,
 				      cpumask_var_t *node_to_cpumask,
 				      const struct cpumask *cpu_mask,
 				      struct cpumask *nmsk,
 				      struct irq_affinity_desc *masks)
 {
 	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
-	unsigned int last_affv = firstvec + numvecs;
+	unsigned int last_affv = numvecs;
 	unsigned int curvec = startvec;
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct node_vectors *node_vectors;
@@ -272,7 +271,7 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 			cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
 				   node_to_cpumask[n]);
 			if (++curvec == last_affv)
-				curvec = firstvec;
+				curvec = 0;
 		}
 		return numvecs;
 	}
@@ -320,7 +319,7 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 			 * may start anywhere
 			 */
 			if (curvec >= last_affv)
-				curvec = firstvec;
+				curvec = 0;
 			irq_spread_init_one(&masks[curvec].mask, nmsk,
 						cpus_per_vec);
 		}
@@ -335,11 +334,10 @@ static int __irq_build_affinity_masks(unsigned int startvec,
  *	1) spread present CPU on these vectors
  *	2) spread other possible CPUs on these vectors
  */
-static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
+static int irq_build_affinity_masks(unsigned int numvecs,
 				    struct irq_affinity_desc *masks)
 {
-	unsigned int curvec = startvec, nr_present = 0, nr_others = 0;
-	unsigned int firstvec = startvec;
+	unsigned int curvec = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret = -ENOMEM;
@@ -359,9 +357,8 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 	build_node_to_cpumask(node_to_cpumask);
 
 	/* Spread on present CPUs starting from affd->pre_vectors */
-	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, cpu_present_mask,
-					 nmsk, masks);
+	ret = __irq_build_affinity_masks(curvec, numvecs, node_to_cpumask,
+					 cpu_present_mask, nmsk, masks);
 	if (ret < 0)
 		goto fail_build_affinity;
 	nr_present = ret;
@@ -373,13 +370,12 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 	 * out vectors.
 	 */
 	if (nr_present >= numvecs)
-		curvec = firstvec;
+		curvec = 0;
 	else
-		curvec = firstvec + nr_present;
+		curvec = nr_present;
 	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
-	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, npresmsk, nmsk,
-					 masks);
+	ret = __irq_build_affinity_masks(curvec, numvecs, node_to_cpumask,
+					 npresmsk, nmsk, masks);
 	if (ret >= 0)
 		nr_others = ret;
 
@@ -469,7 +465,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 		unsigned int this_vecs = affd->set_size[i];
 		int ret;
 
-		ret = irq_build_affinity_masks(curvec, this_vecs, masks);
+		ret = irq_build_affinity_masks(this_vecs, &masks[curvec]);
 		if (ret) {
 			kfree(masks);
 			return NULL;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/7] genirq/affinity: don't pass irq_affinity_desc array to irq_build_affinity_masks
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 1/7] genirq/affinity: remove the 'firstvec' parameter from irq_build_affinity_masks Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 2/7] genirq/affinity: pass affinity managed mask array to irq_build_affinity_masks Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 4/7] genirq/affinity: rename irq_build_affinity_masks as group_cpus_evenly Ming Lei
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

Prepare for abstracting irq_build_affinity_masks() into one public
helper for grouping all CPUs evenly into several groups. Not passing
irq_affinity_desc array to irq_build_affinity_masks, instead returning
one cpumask array by storing each assigned group into one element of
the array.

This way helps us to provide generic interface for grouping all CPUs
evenly from NUMA and CPU locality viewpoint, and the cost is one extra
allocation in irq_build_affinity_masks(), which should be fine since
it is done via GFP_KERNEL and irq_build_affinity_masks() is called very
less.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index c8da750f393b..98343a6c0aff 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -249,7 +249,7 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 				      cpumask_var_t *node_to_cpumask,
 				      const struct cpumask *cpu_mask,
 				      struct cpumask *nmsk,
-				      struct irq_affinity_desc *masks)
+				      struct cpumask *masks)
 {
 	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
 	unsigned int last_affv = numvecs;
@@ -268,7 +268,7 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 	 */
 	if (numvecs <= nodes) {
 		for_each_node_mask(n, nodemsk) {
-			cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
+			cpumask_or(&masks[curvec], &masks[curvec],
 				   node_to_cpumask[n]);
 			if (++curvec == last_affv)
 				curvec = 0;
@@ -320,7 +320,7 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 			 */
 			if (curvec >= last_affv)
 				curvec = 0;
-			irq_spread_init_one(&masks[curvec].mask, nmsk,
+			irq_spread_init_one(&masks[curvec], nmsk,
 						cpus_per_vec);
 		}
 		done += nv->nvectors;
@@ -334,16 +334,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
  *	1) spread present CPU on these vectors
  *	2) spread other possible CPUs on these vectors
  */
-static int irq_build_affinity_masks(unsigned int numvecs,
-				    struct irq_affinity_desc *masks)
+static struct cpumask *irq_build_affinity_masks(unsigned int numvecs)
 {
 	unsigned int curvec = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
 
 	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
-		return ret;
+		return NULL;
 
 	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
 		goto fail_nmsk;
@@ -352,6 +352,10 @@ static int irq_build_affinity_masks(unsigned int numvecs,
 	if (!node_to_cpumask)
 		goto fail_npresmsk;
 
+	masks = kcalloc(numvecs, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
 	/* Stabilize the cpumasks */
 	get_online_cpus();
 	build_node_to_cpumask(node_to_cpumask);
@@ -385,6 +389,7 @@ static int irq_build_affinity_masks(unsigned int numvecs,
 	if (ret >= 0)
 		WARN_ON(nr_present + nr_others < numvecs);
 
+ fail_node_to_cpumask:
 	free_node_to_cpumask(node_to_cpumask);
 
  fail_npresmsk:
@@ -392,7 +397,11 @@ static int irq_build_affinity_masks(unsigned int numvecs,
 
  fail_nmsk:
 	free_cpumask_var(nmsk);
-	return ret < 0 ? ret : 0;
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	return masks;
 }
 
 static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
@@ -463,13 +472,18 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
-		int ret;
+		int j;
+		struct cpumask *result = irq_build_affinity_masks(this_vecs);
 
-		ret = irq_build_affinity_masks(this_vecs, &masks[curvec]);
-		if (ret) {
+		if (!result) {
 			kfree(masks);
 			return NULL;
 		}
+
+		for (j = 0; j < this_vecs; j++)
+			cpumask_copy(&masks[curvec + j].mask, &result[j]);
+		kfree(result);
+
 		curvec += this_vecs;
 		usedvecs += this_vecs;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/7] genirq/affinity: rename irq_build_affinity_masks as group_cpus_evenly
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
                   ` (2 preceding siblings ...)
  2021-07-19  9:57 ` [RFC PATCH 3/7] genirq/affinity: don't pass irq_affinity_desc " Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 5/7] genirq/affinity: move group_cpus_evenly() into lib/ Ming Lei
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

Map irq vector into group, so we can abstract the algorithm for generic
use case.

Rename irq_build_affinity_masks as group_cpus_evenly, so we can reuse
the API for blk-mq to implement default queue mapping.

No functional change, just rename vector as group.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 241 +++++++++++++++++++++---------------------
 1 file changed, 121 insertions(+), 120 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 98343a6c0aff..073200964110 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -9,13 +9,13 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 
-static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
-				unsigned int cpus_per_vec)
+static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
+				unsigned int cpus_per_grp)
 {
 	const struct cpumask *siblmsk;
 	int cpu, sibl;
 
-	for ( ; cpus_per_vec > 0; ) {
+	for ( ; cpus_per_grp > 0; ) {
 		cpu = cpumask_first(nmsk);
 
 		/* Should not happen, but I'm too lazy to think about it */
@@ -24,18 +24,18 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 
 		cpumask_clear_cpu(cpu, nmsk);
 		cpumask_set_cpu(cpu, irqmsk);
-		cpus_per_vec--;
+		cpus_per_grp--;
 
 		/* If the cpu has siblings, use them first */
 		siblmsk = topology_sibling_cpumask(cpu);
-		for (sibl = -1; cpus_per_vec > 0; ) {
+		for (sibl = -1; cpus_per_grp > 0; ) {
 			sibl = cpumask_next(sibl, siblmsk);
 			if (sibl >= nr_cpu_ids)
 				break;
 			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
 				continue;
 			cpumask_set_cpu(sibl, irqmsk);
-			cpus_per_vec--;
+			cpus_per_grp--;
 		}
 	}
 }
@@ -95,48 +95,48 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
 	return nodes;
 }
 
-struct node_vectors {
+struct node_groups {
 	unsigned id;
 
 	union {
-		unsigned nvectors;
+		unsigned ngroups;
 		unsigned ncpus;
 	};
 };
 
 static int ncpus_cmp_func(const void *l, const void *r)
 {
-	const struct node_vectors *ln = l;
-	const struct node_vectors *rn = r;
+	const struct node_groups *ln = l;
+	const struct node_groups *rn = r;
 
 	return ln->ncpus - rn->ncpus;
 }
 
 /*
- * Allocate vector number for each node, so that for each node:
+ * Allocate group number for each node, so that for each node:
  *
  * 1) the allocated number is >= 1
  *
- * 2) the allocated numbver is <= active CPU number of this node
+ * 2) the allocated number is <= active CPU number of this node
  *
- * The actual allocated total vectors may be less than @numvecs when
- * active total CPU number is less than @numvecs.
+ * The actual allocated total groups may be less than @numgrps when
+ * active total CPU number is less than @numgrps.
  *
  * Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
  * for each node.
  */
-static void alloc_nodes_vectors(unsigned int numvecs,
-				cpumask_var_t *node_to_cpumask,
-				const struct cpumask *cpu_mask,
-				const nodemask_t nodemsk,
-				struct cpumask *nmsk,
-				struct node_vectors *node_vectors)
+static void alloc_nodes_groups(unsigned int numgrps,
+			       cpumask_var_t *node_to_cpumask,
+			       const struct cpumask *cpu_mask,
+			       const nodemask_t nodemsk,
+			       struct cpumask *nmsk,
+			       struct node_groups *node_groups)
 {
 	unsigned n, remaining_ncpus = 0;
 
 	for (n = 0; n < nr_node_ids; n++) {
-		node_vectors[n].id = n;
-		node_vectors[n].ncpus = UINT_MAX;
+		node_groups[n].id = n;
+		node_groups[n].ncpus = UINT_MAX;
 	}
 
 	for_each_node_mask(n, nodemsk) {
@@ -148,61 +148,61 @@ static void alloc_nodes_vectors(unsigned int numvecs,
 		if (!ncpus)
 			continue;
 		remaining_ncpus += ncpus;
-		node_vectors[n].ncpus = ncpus;
+		node_groups[n].ncpus = ncpus;
 	}
 
-	numvecs = min_t(unsigned, remaining_ncpus, numvecs);
+	numgrps = min_t(unsigned, remaining_ncpus, numgrps);
 
-	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
+	sort(node_groups, nr_node_ids, sizeof(node_groups[0]),
 	     ncpus_cmp_func, NULL);
 
 	/*
-	 * Allocate vectors for each node according to the ratio of this
-	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
+	 * Allocate groups for each node according to the ratio of this
+	 * node's nr_cpus to remaining un-assigned ncpus. 'numgrps' is
 	 * bigger than number of active numa nodes. Always start the
 	 * allocation from the node with minimized nr_cpus.
 	 *
 	 * This way guarantees that each active node gets allocated at
-	 * least one vector, and the theory is simple: over-allocation
-	 * is only done when this node is assigned by one vector, so
-	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
+	 * least one group, and the theory is simple: over-allocation
+	 * is only done when this node is assigned by one group, so
+	 * other nodes will be allocated >= 1 groups, since 'numgrps' is
 	 * bigger than number of numa nodes.
 	 *
-	 * One perfect invariant is that number of allocated vectors for
+	 * One perfect invariant is that number of allocated groups for
 	 * each node is <= CPU count of this node:
 	 *
 	 * 1) suppose there are two nodes: A and B
 	 * 	ncpu(X) is CPU count of node X
-	 * 	vecs(X) is the vector count allocated to node X via this
+	 * 	grps(X) is the group count allocated to node X via this
 	 * 	algorithm
 	 *
 	 * 	ncpu(A) <= ncpu(B)
 	 * 	ncpu(A) + ncpu(B) = N
-	 * 	vecs(A) + vecs(B) = V
+	 * 	grps(A) + grps(B) = G
 	 *
-	 * 	vecs(A) = max(1, round_down(V * ncpu(A) / N))
-	 * 	vecs(B) = V - vecs(A)
+	 * 	grps(A) = max(1, round_down(G * ncpu(A) / N))
+	 * 	grps(B) = G - grps(A)
 	 *
-	 * 	both N and V are integer, and 2 <= V <= N, suppose
-	 * 	V = N - delta, and 0 <= delta <= N - 2
+	 * 	both N and G are integer, and 2 <= G <= N, suppose
+	 * 	G = N - delta, and 0 <= delta <= N - 2
 	 *
-	 * 2) obviously vecs(A) <= ncpu(A) because:
+	 * 2) obviously grps(A) <= ncpu(A) because:
 	 *
-	 * 	if vecs(A) is 1, then vecs(A) <= ncpu(A) given
+	 * 	if grps(A) is 1, then grps(A) <= ncpu(A) given
 	 * 	ncpu(A) >= 1
 	 *
 	 * 	otherwise,
-	 * 		vecs(A) <= V * ncpu(A) / N <= ncpu(A), given V <= N
+	 * 		grps(A) <= G * ncpu(A) / N <= ncpu(A), given G <= N
 	 *
-	 * 3) prove how vecs(B) <= ncpu(B):
+	 * 3) prove how grps(B) <= ncpu(B):
 	 *
-	 * 	if round_down(V * ncpu(A) / N) == 0, vecs(B) won't be
-	 * 	over-allocated, so vecs(B) <= ncpu(B),
+	 * 	if round_down(G * ncpu(A) / N) == 0, vecs(B) won't be
+	 * 	over-allocated, so grps(B) <= ncpu(B),
 	 *
 	 * 	otherwise:
 	 *
-	 * 	vecs(A) =
-	 * 		round_down(V * ncpu(A) / N) =
+	 * 	grps(A) =
+	 * 		round_down(G * ncpu(A) / N) =
 	 * 		round_down((N - delta) * ncpu(A) / N) =
 	 * 		round_down((N * ncpu(A) - delta * ncpu(A)) / N)	 >=
 	 * 		round_down((N * ncpu(A) - delta * N) / N)	 =
@@ -210,52 +210,50 @@ static void alloc_nodes_vectors(unsigned int numvecs,
 	 *
 	 * 	then:
 	 *
-	 * 	vecs(A) - V >= ncpu(A) - delta - V
+	 * 	grps(A) - G >= ncpu(A) - delta - G
 	 * 	=>
-	 * 	V - vecs(A) <= V + delta - ncpu(A)
+	 * 	G - grps(A) <= G + delta - ncpu(A)
 	 * 	=>
-	 * 	vecs(B) <= N - ncpu(A)
+	 * 	grps(B) <= N - ncpu(A)
 	 * 	=>
-	 * 	vecs(B) <= cpu(B)
+	 * 	grps(B) <= cpu(B)
 	 *
 	 * For nodes >= 3, it can be thought as one node and another big
 	 * node given that is exactly what this algorithm is implemented,
-	 * and we always re-calculate 'remaining_ncpus' & 'numvecs', and
-	 * finally for each node X: vecs(X) <= ncpu(X).
+	 * and we always re-calculate 'remaining_ncpus' & 'numgrps', and
+	 * finally for each node X: grps(X) <= ncpu(X).
 	 *
 	 */
 	for (n = 0; n < nr_node_ids; n++) {
-		unsigned nvectors, ncpus;
+		unsigned ngroups, ncpus;
 
-		if (node_vectors[n].ncpus == UINT_MAX)
+		if (node_groups[n].ncpus == UINT_MAX)
 			continue;
 
-		WARN_ON_ONCE(numvecs == 0);
+		WARN_ON_ONCE(numgrps == 0);
 
-		ncpus = node_vectors[n].ncpus;
-		nvectors = max_t(unsigned, 1,
-				 numvecs * ncpus / remaining_ncpus);
-		WARN_ON_ONCE(nvectors > ncpus);
+		ncpus = node_groups[n].ncpus;
+		ngroups = max_t(unsigned, 1,
+				 numgrps * ncpus / remaining_ncpus);
+		WARN_ON_ONCE(ngroups > ncpus);
 
-		node_vectors[n].nvectors = nvectors;
+		node_groups[n].ngroups = ngroups;
 
 		remaining_ncpus -= ncpus;
-		numvecs -= nvectors;
+		numgrps -= ngroups;
 	}
 }
 
-static int __irq_build_affinity_masks(unsigned int startvec,
-				      unsigned int numvecs,
-				      cpumask_var_t *node_to_cpumask,
-				      const struct cpumask *cpu_mask,
-				      struct cpumask *nmsk,
-				      struct cpumask *masks)
+static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
+			       cpumask_var_t *node_to_cpumask,
+			       const struct cpumask *cpu_mask,
+			       struct cpumask *nmsk, struct cpumask *masks)
 {
-	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
-	unsigned int last_affv = numvecs;
-	unsigned int curvec = startvec;
+	unsigned int i, n, nodes, cpus_per_grp, extra_grps, done = 0;
+	unsigned int last_grp = numgrps;
+	unsigned int curgrp = startgrp;
 	nodemask_t nodemsk = NODE_MASK_NONE;
-	struct node_vectors *node_vectors;
+	struct node_groups *node_groups;
 
 	if (!cpumask_weight(cpu_mask))
 		return 0;
@@ -264,33 +262,33 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 
 	/*
 	 * If the number of nodes in the mask is greater than or equal the
-	 * number of vectors we just spread the vectors across the nodes.
+	 * number of groups we just spread the groups across the nodes.
 	 */
-	if (numvecs <= nodes) {
+	if (numgrps <= nodes) {
 		for_each_node_mask(n, nodemsk) {
-			cpumask_or(&masks[curvec], &masks[curvec],
+			cpumask_or(&masks[curgrp], &masks[curgrp],
 				   node_to_cpumask[n]);
-			if (++curvec == last_affv)
-				curvec = 0;
+			if (++curgrp == last_grp)
+				curgrp = 0;
 		}
-		return numvecs;
+		return numgrps;
 	}
 
-	node_vectors = kcalloc(nr_node_ids,
-			       sizeof(struct node_vectors),
+	node_groups = kcalloc(nr_node_ids,
+			       sizeof(struct node_groups),
 			       GFP_KERNEL);
-	if (!node_vectors)
+	if (!node_groups)
 		return -ENOMEM;
 
-	/* allocate vector number for each node */
-	alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
-			    nodemsk, nmsk, node_vectors);
+	/* allocate group number for each node */
+	alloc_nodes_groups(numgrps, node_to_cpumask, cpu_mask,
+			   nodemsk, nmsk, node_groups);
 
 	for (i = 0; i < nr_node_ids; i++) {
 		unsigned int ncpus, v;
-		struct node_vectors *nv = &node_vectors[i];
+		struct node_groups *nv = &node_groups[i];
 
-		if (nv->nvectors == UINT_MAX)
+		if (nv->ngroups == UINT_MAX)
 			continue;
 
 		/* Get the cpus on this node which are in the mask */
@@ -299,44 +297,47 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 		if (!ncpus)
 			continue;
 
-		WARN_ON_ONCE(nv->nvectors > ncpus);
+		WARN_ON_ONCE(nv->ngroups > ncpus);
 
 		/* Account for rounding errors */
-		extra_vecs = ncpus - nv->nvectors * (ncpus / nv->nvectors);
+		extra_grps = ncpus - nv->ngroups * (ncpus / nv->ngroups);
 
-		/* Spread allocated vectors on CPUs of the current node */
-		for (v = 0; v < nv->nvectors; v++, curvec++) {
-			cpus_per_vec = ncpus / nv->nvectors;
+		/* Spread allocated groups on CPUs of the current node */
+		for (v = 0; v < nv->ngroups; v++, curgrp++) {
+			cpus_per_grp = ncpus / nv->ngroups;
 
-			/* Account for extra vectors to compensate rounding errors */
-			if (extra_vecs) {
-				cpus_per_vec++;
-				--extra_vecs;
+			/* Account for extra groups to compensate rounding errors */
+			if (extra_grps) {
+				cpus_per_grp++;
+				--extra_grps;
 			}
 
 			/*
-			 * wrapping has to be considered given 'startvec'
+			 * wrapping has to be considered given 'startgrp'
 			 * may start anywhere
 			 */
-			if (curvec >= last_affv)
-				curvec = 0;
-			irq_spread_init_one(&masks[curvec], nmsk,
-						cpus_per_vec);
+			if (curgrp >= last_grp)
+				curgrp = 0;
+			grp_spread_init_one(&masks[curgrp], nmsk,
+						cpus_per_grp);
 		}
-		done += nv->nvectors;
+		done += nv->ngroups;
 	}
-	kfree(node_vectors);
+	kfree(node_groups);
 	return done;
 }
 
 /*
- * build affinity in two stages:
- *	1) spread present CPU on these vectors
- *	2) spread other possible CPUs on these vectors
+ * build affinity in two stages for each group, and try to put close CPUs
+ * in viewpoint of CPU and NUMA locality into same group, and we run
+ * two-stage grouping:
+ *
+ *	1) allocate present CPUs on these groups evenly first
+ *	2) allocate other possible CPUs on these groups evenly
  */
-static struct cpumask *irq_build_affinity_masks(unsigned int numvecs)
+static struct cpumask *group_cpus_evenly(unsigned int numgrps)
 {
-	unsigned int curvec = 0, nr_present = 0, nr_others = 0;
+	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret = -ENOMEM;
@@ -352,7 +353,7 @@ static struct cpumask *irq_build_affinity_masks(unsigned int numvecs)
 	if (!node_to_cpumask)
 		goto fail_npresmsk;
 
-	masks = kcalloc(numvecs, sizeof(*masks), GFP_KERNEL);
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
 	if (!masks)
 		goto fail_node_to_cpumask;
 
@@ -360,26 +361,26 @@ static struct cpumask *irq_build_affinity_masks(unsigned int numvecs)
 	get_online_cpus();
 	build_node_to_cpumask(node_to_cpumask);
 
-	/* Spread on present CPUs starting from affd->pre_vectors */
-	ret = __irq_build_affinity_masks(curvec, numvecs, node_to_cpumask,
-					 cpu_present_mask, nmsk, masks);
+	/* grouping present CPUs first */
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  cpu_present_mask, nmsk, masks);
 	if (ret < 0)
 		goto fail_build_affinity;
 	nr_present = ret;
 
 	/*
-	 * Spread on non present CPUs starting from the next vector to be
-	 * handled. If the spreading of present CPUs already exhausted the
-	 * vector space, assign the non present CPUs to the already spread
-	 * out vectors.
+	 * Allocate non present CPUs starting from the next group to be
+	 * handled. If the grouping of present CPUs already exhausted the
+	 * group space, assign the non present CPUs to the already
+	 * allocated out groups.
 	 */
-	if (nr_present >= numvecs)
-		curvec = 0;
+	if (nr_present >= numgrps)
+		curgrp = 0;
 	else
-		curvec = nr_present;
+		curgrp = nr_present;
 	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
-	ret = __irq_build_affinity_masks(curvec, numvecs, node_to_cpumask,
-					 npresmsk, nmsk, masks);
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  npresmsk, nmsk, masks);
 	if (ret >= 0)
 		nr_others = ret;
 
@@ -387,7 +388,7 @@ static struct cpumask *irq_build_affinity_masks(unsigned int numvecs)
 	put_online_cpus();
 
 	if (ret >= 0)
-		WARN_ON(nr_present + nr_others < numvecs);
+		WARN_ON(nr_present + nr_others < numgrps);
 
  fail_node_to_cpumask:
 	free_node_to_cpumask(node_to_cpumask);
@@ -473,7 +474,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = irq_build_affinity_masks(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs);
 
 		if (!result) {
 			kfree(masks);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/7] genirq/affinity: move group_cpus_evenly() into lib/
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
                   ` (3 preceding siblings ...)
  2021-07-19  9:57 ` [RFC PATCH 4/7] genirq/affinity: rename irq_build_affinity_masks as group_cpus_evenly Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 6/7] lib/group_cpus: allow to group cpus in case of !CONFIG_SMP Ming Lei
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

group_cpus_evenly() has become one generic helper which can be used for
other subsystems, so move it into lib/.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/group_cpus.h |  14 ++
 kernel/irq/affinity.c      | 398 +----------------------------------
 lib/Makefile               |   2 +
 lib/group_cpus.c           | 413 +++++++++++++++++++++++++++++++++++++
 4 files changed, 430 insertions(+), 397 deletions(-)
 create mode 100644 include/linux/group_cpus.h
 create mode 100644 lib/group_cpus.c

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
new file mode 100644
index 000000000000..e42807ec61f6
--- /dev/null
+++ b/include/linux/group_cpus.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
+
+#ifndef __LINUX_GROUP_CPUS_H
+#define __LINUX_GROUP_CPUS_H
+#include <linux/kernel.h>
+#include <linux/cpu.h>
+
+struct cpumask *group_cpus_evenly(unsigned int numgrps);
+
+#endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 073200964110..40f55024fd44 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -7,403 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
-#include <linux/sort.h>
-
-static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
-				unsigned int cpus_per_grp)
-{
-	const struct cpumask *siblmsk;
-	int cpu, sibl;
-
-	for ( ; cpus_per_grp > 0; ) {
-		cpu = cpumask_first(nmsk);
-
-		/* Should not happen, but I'm too lazy to think about it */
-		if (cpu >= nr_cpu_ids)
-			return;
-
-		cpumask_clear_cpu(cpu, nmsk);
-		cpumask_set_cpu(cpu, irqmsk);
-		cpus_per_grp--;
-
-		/* If the cpu has siblings, use them first */
-		siblmsk = topology_sibling_cpumask(cpu);
-		for (sibl = -1; cpus_per_grp > 0; ) {
-			sibl = cpumask_next(sibl, siblmsk);
-			if (sibl >= nr_cpu_ids)
-				break;
-			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
-				continue;
-			cpumask_set_cpu(sibl, irqmsk);
-			cpus_per_grp--;
-		}
-	}
-}
-
-static cpumask_var_t *alloc_node_to_cpumask(void)
-{
-	cpumask_var_t *masks;
-	int node;
-
-	masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
-	if (!masks)
-		return NULL;
-
-	for (node = 0; node < nr_node_ids; node++) {
-		if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
-			goto out_unwind;
-	}
-
-	return masks;
-
-out_unwind:
-	while (--node >= 0)
-		free_cpumask_var(masks[node]);
-	kfree(masks);
-	return NULL;
-}
-
-static void free_node_to_cpumask(cpumask_var_t *masks)
-{
-	int node;
-
-	for (node = 0; node < nr_node_ids; node++)
-		free_cpumask_var(masks[node]);
-	kfree(masks);
-}
-
-static void build_node_to_cpumask(cpumask_var_t *masks)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
-}
-
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
-				const struct cpumask *mask, nodemask_t *nodemsk)
-{
-	int n, nodes = 0;
-
-	/* Calculate the number of nodes in the supplied affinity mask */
-	for_each_node(n) {
-		if (cpumask_intersects(mask, node_to_cpumask[n])) {
-			node_set(n, *nodemsk);
-			nodes++;
-		}
-	}
-	return nodes;
-}
-
-struct node_groups {
-	unsigned id;
-
-	union {
-		unsigned ngroups;
-		unsigned ncpus;
-	};
-};
-
-static int ncpus_cmp_func(const void *l, const void *r)
-{
-	const struct node_groups *ln = l;
-	const struct node_groups *rn = r;
-
-	return ln->ncpus - rn->ncpus;
-}
-
-/*
- * Allocate group number for each node, so that for each node:
- *
- * 1) the allocated number is >= 1
- *
- * 2) the allocated number is <= active CPU number of this node
- *
- * The actual allocated total groups may be less than @numgrps when
- * active total CPU number is less than @numgrps.
- *
- * Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
- * for each node.
- */
-static void alloc_nodes_groups(unsigned int numgrps,
-			       cpumask_var_t *node_to_cpumask,
-			       const struct cpumask *cpu_mask,
-			       const nodemask_t nodemsk,
-			       struct cpumask *nmsk,
-			       struct node_groups *node_groups)
-{
-	unsigned n, remaining_ncpus = 0;
-
-	for (n = 0; n < nr_node_ids; n++) {
-		node_groups[n].id = n;
-		node_groups[n].ncpus = UINT_MAX;
-	}
-
-	for_each_node_mask(n, nodemsk) {
-		unsigned ncpus;
-
-		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
-		ncpus = cpumask_weight(nmsk);
-
-		if (!ncpus)
-			continue;
-		remaining_ncpus += ncpus;
-		node_groups[n].ncpus = ncpus;
-	}
-
-	numgrps = min_t(unsigned, remaining_ncpus, numgrps);
-
-	sort(node_groups, nr_node_ids, sizeof(node_groups[0]),
-	     ncpus_cmp_func, NULL);
-
-	/*
-	 * Allocate groups for each node according to the ratio of this
-	 * node's nr_cpus to remaining un-assigned ncpus. 'numgrps' is
-	 * bigger than number of active numa nodes. Always start the
-	 * allocation from the node with minimized nr_cpus.
-	 *
-	 * This way guarantees that each active node gets allocated at
-	 * least one group, and the theory is simple: over-allocation
-	 * is only done when this node is assigned by one group, so
-	 * other nodes will be allocated >= 1 groups, since 'numgrps' is
-	 * bigger than number of numa nodes.
-	 *
-	 * One perfect invariant is that number of allocated groups for
-	 * each node is <= CPU count of this node:
-	 *
-	 * 1) suppose there are two nodes: A and B
-	 * 	ncpu(X) is CPU count of node X
-	 * 	grps(X) is the group count allocated to node X via this
-	 * 	algorithm
-	 *
-	 * 	ncpu(A) <= ncpu(B)
-	 * 	ncpu(A) + ncpu(B) = N
-	 * 	grps(A) + grps(B) = G
-	 *
-	 * 	grps(A) = max(1, round_down(G * ncpu(A) / N))
-	 * 	grps(B) = G - grps(A)
-	 *
-	 * 	both N and G are integer, and 2 <= G <= N, suppose
-	 * 	G = N - delta, and 0 <= delta <= N - 2
-	 *
-	 * 2) obviously grps(A) <= ncpu(A) because:
-	 *
-	 * 	if grps(A) is 1, then grps(A) <= ncpu(A) given
-	 * 	ncpu(A) >= 1
-	 *
-	 * 	otherwise,
-	 * 		grps(A) <= G * ncpu(A) / N <= ncpu(A), given G <= N
-	 *
-	 * 3) prove how grps(B) <= ncpu(B):
-	 *
-	 * 	if round_down(G * ncpu(A) / N) == 0, vecs(B) won't be
-	 * 	over-allocated, so grps(B) <= ncpu(B),
-	 *
-	 * 	otherwise:
-	 *
-	 * 	grps(A) =
-	 * 		round_down(G * ncpu(A) / N) =
-	 * 		round_down((N - delta) * ncpu(A) / N) =
-	 * 		round_down((N * ncpu(A) - delta * ncpu(A)) / N)	 >=
-	 * 		round_down((N * ncpu(A) - delta * N) / N)	 =
-	 * 		cpu(A) - delta
-	 *
-	 * 	then:
-	 *
-	 * 	grps(A) - G >= ncpu(A) - delta - G
-	 * 	=>
-	 * 	G - grps(A) <= G + delta - ncpu(A)
-	 * 	=>
-	 * 	grps(B) <= N - ncpu(A)
-	 * 	=>
-	 * 	grps(B) <= cpu(B)
-	 *
-	 * For nodes >= 3, it can be thought as one node and another big
-	 * node given that is exactly what this algorithm is implemented,
-	 * and we always re-calculate 'remaining_ncpus' & 'numgrps', and
-	 * finally for each node X: grps(X) <= ncpu(X).
-	 *
-	 */
-	for (n = 0; n < nr_node_ids; n++) {
-		unsigned ngroups, ncpus;
-
-		if (node_groups[n].ncpus == UINT_MAX)
-			continue;
-
-		WARN_ON_ONCE(numgrps == 0);
-
-		ncpus = node_groups[n].ncpus;
-		ngroups = max_t(unsigned, 1,
-				 numgrps * ncpus / remaining_ncpus);
-		WARN_ON_ONCE(ngroups > ncpus);
-
-		node_groups[n].ngroups = ngroups;
-
-		remaining_ncpus -= ncpus;
-		numgrps -= ngroups;
-	}
-}
-
-static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
-			       cpumask_var_t *node_to_cpumask,
-			       const struct cpumask *cpu_mask,
-			       struct cpumask *nmsk, struct cpumask *masks)
-{
-	unsigned int i, n, nodes, cpus_per_grp, extra_grps, done = 0;
-	unsigned int last_grp = numgrps;
-	unsigned int curgrp = startgrp;
-	nodemask_t nodemsk = NODE_MASK_NONE;
-	struct node_groups *node_groups;
-
-	if (!cpumask_weight(cpu_mask))
-		return 0;
-
-	nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
-
-	/*
-	 * If the number of nodes in the mask is greater than or equal the
-	 * number of groups we just spread the groups across the nodes.
-	 */
-	if (numgrps <= nodes) {
-		for_each_node_mask(n, nodemsk) {
-			cpumask_or(&masks[curgrp], &masks[curgrp],
-				   node_to_cpumask[n]);
-			if (++curgrp == last_grp)
-				curgrp = 0;
-		}
-		return numgrps;
-	}
-
-	node_groups = kcalloc(nr_node_ids,
-			       sizeof(struct node_groups),
-			       GFP_KERNEL);
-	if (!node_groups)
-		return -ENOMEM;
-
-	/* allocate group number for each node */
-	alloc_nodes_groups(numgrps, node_to_cpumask, cpu_mask,
-			   nodemsk, nmsk, node_groups);
-
-	for (i = 0; i < nr_node_ids; i++) {
-		unsigned int ncpus, v;
-		struct node_groups *nv = &node_groups[i];
-
-		if (nv->ngroups == UINT_MAX)
-			continue;
-
-		/* Get the cpus on this node which are in the mask */
-		cpumask_and(nmsk, cpu_mask, node_to_cpumask[nv->id]);
-		ncpus = cpumask_weight(nmsk);
-		if (!ncpus)
-			continue;
-
-		WARN_ON_ONCE(nv->ngroups > ncpus);
-
-		/* Account for rounding errors */
-		extra_grps = ncpus - nv->ngroups * (ncpus / nv->ngroups);
-
-		/* Spread allocated groups on CPUs of the current node */
-		for (v = 0; v < nv->ngroups; v++, curgrp++) {
-			cpus_per_grp = ncpus / nv->ngroups;
-
-			/* Account for extra groups to compensate rounding errors */
-			if (extra_grps) {
-				cpus_per_grp++;
-				--extra_grps;
-			}
-
-			/*
-			 * wrapping has to be considered given 'startgrp'
-			 * may start anywhere
-			 */
-			if (curgrp >= last_grp)
-				curgrp = 0;
-			grp_spread_init_one(&masks[curgrp], nmsk,
-						cpus_per_grp);
-		}
-		done += nv->ngroups;
-	}
-	kfree(node_groups);
-	return done;
-}
-
-/*
- * build affinity in two stages for each group, and try to put close CPUs
- * in viewpoint of CPU and NUMA locality into same group, and we run
- * two-stage grouping:
- *
- *	1) allocate present CPUs on these groups evenly first
- *	2) allocate other possible CPUs on these groups evenly
- */
-static struct cpumask *group_cpus_evenly(unsigned int numgrps)
-{
-	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
-	cpumask_var_t *node_to_cpumask;
-	cpumask_var_t nmsk, npresmsk;
-	int ret = -ENOMEM;
-	struct cpumask *masks = NULL;
-
-	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
-		return NULL;
-
-	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
-		goto fail_nmsk;
-
-	node_to_cpumask = alloc_node_to_cpumask();
-	if (!node_to_cpumask)
-		goto fail_npresmsk;
-
-	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
-	if (!masks)
-		goto fail_node_to_cpumask;
-
-	/* Stabilize the cpumasks */
-	get_online_cpus();
-	build_node_to_cpumask(node_to_cpumask);
-
-	/* grouping present CPUs first */
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  cpu_present_mask, nmsk, masks);
-	if (ret < 0)
-		goto fail_build_affinity;
-	nr_present = ret;
-
-	/*
-	 * Allocate non present CPUs starting from the next group to be
-	 * handled. If the grouping of present CPUs already exhausted the
-	 * group space, assign the non present CPUs to the already
-	 * allocated out groups.
-	 */
-	if (nr_present >= numgrps)
-		curgrp = 0;
-	else
-		curgrp = nr_present;
-	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  npresmsk, nmsk, masks);
-	if (ret >= 0)
-		nr_others = ret;
-
- fail_build_affinity:
-	put_online_cpus();
-
-	if (ret >= 0)
-		WARN_ON(nr_present + nr_others < numgrps);
-
- fail_node_to_cpumask:
-	free_node_to_cpumask(node_to_cpumask);
-
- fail_npresmsk:
-	free_cpumask_var(npresmsk);
-
- fail_nmsk:
-	free_cpumask_var(nmsk);
-	if (ret < 0) {
-		kfree(masks);
-		return NULL;
-	}
-	return masks;
-}
+#include <linux/group_cpus.h>
 
 static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
 {
diff --git a/lib/Makefile b/lib/Makefile
index 5efd1b435a37..ff1cbe4958a1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -338,6 +338,8 @@ obj-$(CONFIG_SBITMAP) += sbitmap.o
 
 obj-$(CONFIG_PARMAN) += parman.o
 
+obj-$(CONFIG_SMP) += group_cpus.o
+
 # GCC library routines
 obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
 obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
new file mode 100644
index 000000000000..b9fd830ceec5
--- /dev/null
+++ b/lib/group_cpus.c
@@ -0,0 +1,413 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/sort.h>
+
+static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
+				unsigned int cpus_per_grp)
+{
+	const struct cpumask *siblmsk;
+	int cpu, sibl;
+
+	for ( ; cpus_per_grp > 0; ) {
+		cpu = cpumask_first(nmsk);
+
+		/* Should not happen, but I'm too lazy to think about it */
+		if (cpu >= nr_cpu_ids)
+			return;
+
+		cpumask_clear_cpu(cpu, nmsk);
+		cpumask_set_cpu(cpu, irqmsk);
+		cpus_per_grp--;
+
+		/* If the cpu has siblings, use them first */
+		siblmsk = topology_sibling_cpumask(cpu);
+		for (sibl = -1; cpus_per_grp > 0; ) {
+			sibl = cpumask_next(sibl, siblmsk);
+			if (sibl >= nr_cpu_ids)
+				break;
+			if (!cpumask_test_and_clear_cpu(sibl, nmsk))
+				continue;
+			cpumask_set_cpu(sibl, irqmsk);
+			cpus_per_grp--;
+		}
+	}
+}
+
+static cpumask_var_t *alloc_node_to_cpumask(void)
+{
+	cpumask_var_t *masks;
+	int node;
+
+	masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+	if (!masks)
+		return NULL;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
+			goto out_unwind;
+	}
+
+	return masks;
+
+out_unwind:
+	while (--node >= 0)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+	return NULL;
+}
+
+static void free_node_to_cpumask(cpumask_var_t *masks)
+{
+	int node;
+
+	for (node = 0; node < nr_node_ids; node++)
+		free_cpumask_var(masks[node]);
+	kfree(masks);
+}
+
+static void build_node_to_cpumask(cpumask_var_t *masks)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
+}
+
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
+				const struct cpumask *mask, nodemask_t *nodemsk)
+{
+	int n, nodes = 0;
+
+	/* Calculate the number of nodes in the supplied affinity mask */
+	for_each_node(n) {
+		if (cpumask_intersects(mask, node_to_cpumask[n])) {
+			node_set(n, *nodemsk);
+			nodes++;
+		}
+	}
+	return nodes;
+}
+
+struct node_groups {
+	unsigned id;
+
+	union {
+		unsigned ngroups;
+		unsigned ncpus;
+	};
+};
+
+static int ncpus_cmp_func(const void *l, const void *r)
+{
+	const struct node_groups *ln = l;
+	const struct node_groups *rn = r;
+
+	return ln->ncpus - rn->ncpus;
+}
+
+/*
+ * Allocate group number for each node, so that for each node:
+ *
+ * 1) the allocated number is >= 1
+ *
+ * 2) the allocated number is <= active CPU number of this node
+ *
+ * The actual allocated total groups may be less than @numgrps when
+ * active total CPU number is less than @numgrps.
+ *
+ * Active CPUs means the CPUs in '@cpu_mask AND @node_to_cpumask[]'
+ * for each node.
+ */
+static void alloc_nodes_groups(unsigned int numgrps,
+			       cpumask_var_t *node_to_cpumask,
+			       const struct cpumask *cpu_mask,
+			       const nodemask_t nodemsk,
+			       struct cpumask *nmsk,
+			       struct node_groups *node_groups)
+{
+	unsigned n, remaining_ncpus = 0;
+
+	for (n = 0; n < nr_node_ids; n++) {
+		node_groups[n].id = n;
+		node_groups[n].ncpus = UINT_MAX;
+	}
+
+	for_each_node_mask(n, nodemsk) {
+		unsigned ncpus;
+
+		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
+		ncpus = cpumask_weight(nmsk);
+
+		if (!ncpus)
+			continue;
+		remaining_ncpus += ncpus;
+		node_groups[n].ncpus = ncpus;
+	}
+
+	numgrps = min_t(unsigned, remaining_ncpus, numgrps);
+
+	sort(node_groups, nr_node_ids, sizeof(node_groups[0]),
+	     ncpus_cmp_func, NULL);
+
+	/*
+	 * Allocate groups for each node according to the ratio of this
+	 * node's nr_cpus to remaining un-assigned ncpus. 'numgrps' is
+	 * bigger than number of active numa nodes. Always start the
+	 * allocation from the node with minimized nr_cpus.
+	 *
+	 * This way guarantees that each active node gets allocated at
+	 * least one group, and the theory is simple: over-allocation
+	 * is only done when this node is assigned by one group, so
+	 * other nodes will be allocated >= 1 groups, since 'numgrps' is
+	 * bigger than number of numa nodes.
+	 *
+	 * One perfect invariant is that number of allocated groups for
+	 * each node is <= CPU count of this node:
+	 *
+	 * 1) suppose there are two nodes: A and B
+	 * 	ncpu(X) is CPU count of node X
+	 * 	grps(X) is the group count allocated to node X via this
+	 * 	algorithm
+	 *
+	 * 	ncpu(A) <= ncpu(B)
+	 * 	ncpu(A) + ncpu(B) = N
+	 * 	grps(A) + grps(B) = G
+	 *
+	 * 	grps(A) = max(1, round_down(G * ncpu(A) / N))
+	 * 	grps(B) = G - grps(A)
+	 *
+	 * 	both N and G are integer, and 2 <= G <= N, suppose
+	 * 	G = N - delta, and 0 <= delta <= N - 2
+	 *
+	 * 2) obviously grps(A) <= ncpu(A) because:
+	 *
+	 * 	if grps(A) is 1, then grps(A) <= ncpu(A) given
+	 * 	ncpu(A) >= 1
+	 *
+	 * 	otherwise,
+	 * 		grps(A) <= G * ncpu(A) / N <= ncpu(A), given G <= N
+	 *
+	 * 3) prove how grps(B) <= ncpu(B):
+	 *
+	 * 	if round_down(G * ncpu(A) / N) == 0, vecs(B) won't be
+	 * 	over-allocated, so grps(B) <= ncpu(B),
+	 *
+	 * 	otherwise:
+	 *
+	 * 	grps(A) =
+	 * 		round_down(G * ncpu(A) / N) =
+	 * 		round_down((N - delta) * ncpu(A) / N) =
+	 * 		round_down((N * ncpu(A) - delta * ncpu(A)) / N)	 >=
+	 * 		round_down((N * ncpu(A) - delta * N) / N)	 =
+	 * 		cpu(A) - delta
+	 *
+	 * 	then:
+	 *
+	 * 	grps(A) - G >= ncpu(A) - delta - G
+	 * 	=>
+	 * 	G - grps(A) <= G + delta - ncpu(A)
+	 * 	=>
+	 * 	grps(B) <= N - ncpu(A)
+	 * 	=>
+	 * 	grps(B) <= cpu(B)
+	 *
+	 * For nodes >= 3, it can be thought as one node and another big
+	 * node given that is exactly what this algorithm is implemented,
+	 * and we always re-calculate 'remaining_ncpus' & 'numgrps', and
+	 * finally for each node X: grps(X) <= ncpu(X).
+	 *
+	 */
+	for (n = 0; n < nr_node_ids; n++) {
+		unsigned ngroups, ncpus;
+
+		if (node_groups[n].ncpus == UINT_MAX)
+			continue;
+
+		WARN_ON_ONCE(numgrps == 0);
+
+		ncpus = node_groups[n].ncpus;
+		ngroups = max_t(unsigned, 1,
+				 numgrps * ncpus / remaining_ncpus);
+		WARN_ON_ONCE(ngroups > ncpus);
+
+		node_groups[n].ngroups = ngroups;
+
+		remaining_ncpus -= ncpus;
+		numgrps -= ngroups;
+	}
+}
+
+static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
+			       cpumask_var_t *node_to_cpumask,
+			       const struct cpumask *cpu_mask,
+			       struct cpumask *nmsk, struct cpumask *masks)
+{
+	unsigned int i, n, nodes, cpus_per_grp, extra_grps, done = 0;
+	unsigned int last_grp = numgrps;
+	unsigned int curgrp = startgrp;
+	nodemask_t nodemsk = NODE_MASK_NONE;
+	struct node_groups *node_groups;
+
+	if (!cpumask_weight(cpu_mask))
+		return 0;
+
+	nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);
+
+	/*
+	 * If the number of nodes in the mask is greater than or equal the
+	 * number of groups we just spread the groups across the nodes.
+	 */
+	if (numgrps <= nodes) {
+		for_each_node_mask(n, nodemsk) {
+			cpumask_or(&masks[curgrp], &masks[curgrp],
+				   node_to_cpumask[n]);
+			if (++curgrp == last_grp)
+				curgrp = 0;
+		}
+		return numgrps;
+	}
+
+	node_groups = kcalloc(nr_node_ids,
+			       sizeof(struct node_groups),
+			       GFP_KERNEL);
+	if (!node_groups)
+		return -ENOMEM;
+
+	/* allocate group number for each node */
+	alloc_nodes_groups(numgrps, node_to_cpumask, cpu_mask,
+			   nodemsk, nmsk, node_groups);
+
+	for (i = 0; i < nr_node_ids; i++) {
+		unsigned int ncpus, v;
+		struct node_groups *nv = &node_groups[i];
+
+		if (nv->ngroups == UINT_MAX)
+			continue;
+
+		/* Get the cpus on this node which are in the mask */
+		cpumask_and(nmsk, cpu_mask, node_to_cpumask[nv->id]);
+		ncpus = cpumask_weight(nmsk);
+		if (!ncpus)
+			continue;
+
+		WARN_ON_ONCE(nv->ngroups > ncpus);
+
+		/* Account for rounding errors */
+		extra_grps = ncpus - nv->ngroups * (ncpus / nv->ngroups);
+
+		/* Spread allocated groups on CPUs of the current node */
+		for (v = 0; v < nv->ngroups; v++, curgrp++) {
+			cpus_per_grp = ncpus / nv->ngroups;
+
+			/* Account for extra groups to compensate rounding errors */
+			if (extra_grps) {
+				cpus_per_grp++;
+				--extra_grps;
+			}
+
+			/*
+			 * wrapping has to be considered given 'startgrp'
+			 * may start anywhere
+			 */
+			if (curgrp >= last_grp)
+				curgrp = 0;
+			grp_spread_init_one(&masks[curgrp], nmsk,
+						cpus_per_grp);
+		}
+		done += nv->ngroups;
+	}
+	kfree(node_groups);
+	return done;
+}
+
+/**
+ * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group, and run two-stage grouping:
+ *	1) allocate present CPUs on these groups evenly first
+ *	2) allocate other possible CPUs on these groups evenly
+ *
+ * We guarantee in the resulted grouping that all CPUs are covered, and
+ * no same CPU is assigned to different groups
+ */
+struct cpumask *group_cpus_evenly(unsigned int numgrps)
+{
+	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk, npresmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+		goto fail_nmsk;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_npresmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	/* Stabilize the cpumasks */
+	get_online_cpus();
+	build_node_to_cpumask(node_to_cpumask);
+
+	/* grouping present CPUs first */
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  cpu_present_mask, nmsk, masks);
+	if (ret < 0)
+		goto fail_build_affinity;
+	nr_present = ret;
+
+	/*
+	 * Allocate non present CPUs starting from the next group to be
+	 * handled. If the grouping of present CPUs already exhausted the
+	 * group space, assign the non present CPUs to the already
+	 * allocated out groups.
+	 */
+	if (nr_present >= numgrps)
+		curgrp = 0;
+	else
+		curgrp = nr_present;
+	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  npresmsk, nmsk, masks);
+	if (ret >= 0)
+		nr_others = ret;
+
+ fail_build_affinity:
+	put_online_cpus();
+
+	if (ret >= 0)
+		WARN_ON(nr_present + nr_others < numgrps);
+
+ fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+ fail_npresmsk:
+	free_cpumask_var(npresmsk);
+
+ fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	return masks;
+}
+EXPORT_SYMBOL_GPL(group_cpus_evenly);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 6/7] lib/group_cpus: allow to group cpus in case of !CONFIG_SMP
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
                   ` (4 preceding siblings ...)
  2021-07-19  9:57 ` [RFC PATCH 5/7] genirq/affinity: move group_cpus_evenly() into lib/ Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-19  9:57 ` [RFC PATCH 7/7] blk-mq: build default queue map via group_cpus_evenly() Ming Lei
  2021-07-26 10:37 ` [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

Allows group_cpus_evenly() to be called in case of !CONFIG_SMP by simply
assigning all CPUs into the 1st group.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/group_cpus.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6..79e5cc15bd96 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,20 @@
 #include <linux/kernel.h>
 #include <linux/cpu.h>
 
+#ifdef CONFIG_SMP
 struct cpumask *group_cpus_evenly(unsigned int numgrps);
+#else
+static inline struct cpumask *group_cpus_evenly(unsigned int numgrps)
+{
+	struct cpumask *masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+
+	if (!masks)
+		return NULL;
+
+	/* assign all CPUs(cpu 0) to the 1st group only */
+	cpumask_copy(&masks[0], cpu_possible_mask);
+	return masks;
+}
+#endif
 
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 7/7] blk-mq: build default queue map via group_cpus_evenly()
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
                   ` (5 preceding siblings ...)
  2021-07-19  9:57 ` [RFC PATCH 6/7] lib/group_cpus: allow to group cpus in case of !CONFIG_SMP Ming Lei
@ 2021-07-19  9:57 ` Ming Lei
  2021-07-26 10:37 ` [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-19  9:57 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: linux-kernel, linux-block, Christoph Hellwig, Ming Lei

The default queue mapping builder of blk_mq_map_queues doesn't take NUMA
topo into account, so the built mapping is pretty bad, since CPUs
belonging to different NUMA node are assigned to same queue. It is
observed that IOPS drops by ~30% when running two jobs on same hctx
of null_blk from two CPUs belonging to two NUMA nodes compared with
from same NUMA node.

Address the issue by reusing group_cpus_evenly() for addressing the
issue since group_cpus_evenly() does group cpus according to CPU/NUMA
locality.

Lots of drivers may benefit from the change, such as nvme pci poll,
nvme tcp, ...

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-cpumap.c | 64 +++++++++----------------------------------
 1 file changed, 13 insertions(+), 51 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3db84d3197f1..5f183f52626c 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -10,67 +10,29 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/group_cpus.h>
 
 #include <linux/blk-mq.h>
 #include "blk.h"
 #include "blk-mq.h"
 
-static int queue_index(struct blk_mq_queue_map *qmap,
-		       unsigned int nr_queues, const int q)
-{
-	return qmap->queue_offset + (q % nr_queues);
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
-	unsigned int *map = qmap->mq_map;
-	unsigned int nr_queues = qmap->nr_queues;
-	unsigned int cpu, first_sibling, q = 0;
-
-	for_each_possible_cpu(cpu)
-		map[cpu] = -1;
+	const struct cpumask *masks;
+	unsigned int queue, cpu;
 
-	/*
-	 * Spread queues among present CPUs first for minimizing
-	 * count of dead queues which are mapped by all un-present CPUs
-	 */
-	for_each_present_cpu(cpu) {
-		if (q >= nr_queues)
-			break;
-		map[cpu] = queue_index(qmap, nr_queues, q++);
-	}
+	masks = group_cpus_evenly(qmap->nr_queues);
+	if (!masks)
+		goto fallback;
 
-	for_each_possible_cpu(cpu) {
-		if (map[cpu] != -1)
-			continue;
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue
-		 * for performance optimizations.
-		 */
-		if (q < nr_queues) {
-			map[cpu] = queue_index(qmap, nr_queues, q++);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = queue_index(qmap, nr_queues, q++);
-			else
-				map[cpu] = map[first_sibling];
-		}
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, &masks[queue])
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
-
+	return 0;
+ fallback:
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = qmap->queue_offset;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread
  2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
                   ` (6 preceding siblings ...)
  2021-07-19  9:57 ` [RFC PATCH 7/7] blk-mq: build default queue map via group_cpus_evenly() Ming Lei
@ 2021-07-26 10:37 ` Ming Lei
  7 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2021-07-26 10:37 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe; +Cc: linux-kernel, linux-block, Christoph Hellwig

Hello Guys,

On Mon, Jul 19, 2021 at 05:57:22PM +0800, Ming Lei wrote:
> Hello,
> 
> irq_build_affinity_masks() actually grouping CPUs evenly into each managed
> irq vector according to NUMA and CPU locality, and it is reasonable to abstract
> one generic API for grouping CPUs evenly, the idea is suggested by Thomas
> Gleixner.
> 
> group_cpus_evenly() is abstracted and put into lib/, so blk-mq can re-use
> it to build default queue mapping.
> 
> Please comments!

Any comments on this patchset?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-26 10:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19  9:57 [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei
2021-07-19  9:57 ` [RFC PATCH 1/7] genirq/affinity: remove the 'firstvec' parameter from irq_build_affinity_masks Ming Lei
2021-07-19  9:57 ` [RFC PATCH 2/7] genirq/affinity: pass affinity managed mask array to irq_build_affinity_masks Ming Lei
2021-07-19  9:57 ` [RFC PATCH 3/7] genirq/affinity: don't pass irq_affinity_desc " Ming Lei
2021-07-19  9:57 ` [RFC PATCH 4/7] genirq/affinity: rename irq_build_affinity_masks as group_cpus_evenly Ming Lei
2021-07-19  9:57 ` [RFC PATCH 5/7] genirq/affinity: move group_cpus_evenly() into lib/ Ming Lei
2021-07-19  9:57 ` [RFC PATCH 6/7] lib/group_cpus: allow to group cpus in case of !CONFIG_SMP Ming Lei
2021-07-19  9:57 ` [RFC PATCH 7/7] blk-mq: build default queue map via group_cpus_evenly() Ming Lei
2021-07-26 10:37 ` [RFC PATCH 0/7] genirq/affinity: abstract new API from managed irq affinity spread Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).