linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/3] genriq/affinity: Make vectors allocation fair
@ 2019-08-12  9:57 Ming Lei
  2019-08-12  9:57 ` [PATCH V2 1/3] genirq/affinity: Improve __irq_build_affinity_masks() Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ming Lei @ 2019-08-12  9:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Jens Axboe, Christoph Hellwig,
	Keith Busch, linux-nvme, Jon Derrick

Hi,

The 1st patch makes __irq_build_affinity_masks() more reliable, such as,
all nodes can be covered in the spread.

The 2nd patch spread vectors on node according to the ratio of this node's
CPU number to number of all remaining CPUs, then vectors assignment can
become more fair. Meantime, the warning report from Jon Derrick can be
fixed.

The 3rd patch enhances one warning check.

Please review & comment!


V2:
	- add patch3
	- start to allocate vectors from node with minimized CPU number,
	  then every node is guaranteed to be allocated at least one vector.
	- avoid cross node spread

Ming Lei (3):
  genirq/affinity: Improve __irq_build_affinity_masks()
  genirq/affinity: Spread vectors on node according to nr_cpu ratio
  genirq/affinity: Enhance warning check

 kernel/irq/affinity.c | 140 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 121 insertions(+), 19 deletions(-)

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org,
Cc: Jon Derrick <jonathan.derrick@intel.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V2 1/3] genirq/affinity: Improve __irq_build_affinity_masks()
  2019-08-12  9:57 [PATCH V2 0/3] genriq/affinity: Make vectors allocation fair Ming Lei
@ 2019-08-12  9:57 ` Ming Lei
  2019-08-12  9:57 ` [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio Ming Lei
  2019-08-12  9:57 ` [PATCH V2 3/3] genirq/affinity: Enhance warning check Ming Lei
  2 siblings, 0 replies; 8+ messages in thread
From: Ming Lei @ 2019-08-12  9:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Christoph Hellwig, Keith Busch,
	linux-nvme, Jon Derrick, Jens Axboe

One invariant of __irq_build_affinity_masks() is that all CPUs in the
specified masks( cpu_mask AND node_to_cpumask for each node) should be
covered during the spread. Even though all requested vectors have been
reached, we still need to spread vectors among remained CPUs. The similar
policy has been taken in case of 'numvecs <= nodes' already:

So remove the following check inside the loop:

	if (done >= numvecs)
		break;

Meantime assign at least 1 vector for remained nodes if 'numvecs' vectors
have been handled already.

Also, if the specified cpumask for one numa node is empty, simply not
spread vectors on this node.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org,
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6fef48033f96..c7cca942bd8a 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -129,14 +129,26 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 	for_each_node_mask(n, nodemsk) {
 		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
 
-		/* Spread the vectors per node */
-		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
-
 		/* Get the cpus on this node which are in the mask */
 		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
-
-		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
+		if (!ncpus)
+			continue;
+
+		/*
+		 * Calculate the number of cpus per vector
+		 *
+		 * Spread the vectors evenly per node. If the requested
+		 * vector number has been reached, simply allocate one
+		 * vector for each remaining node so that all nodes can
+		 * be covered
+		 */
+		if (numvecs > done)
+			vecs_per_node = max_t(unsigned,
+					(numvecs - done) / nodes, 1);
+		else
+			vecs_per_node = 1;
+
 		vecs_to_assign = min(vecs_per_node, ncpus);
 
 		/* Account for rounding errors */
@@ -156,13 +168,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 		}
 
 		done += v;
-		if (done >= numvecs)
-			break;
 		if (curvec >= last_affv)
 			curvec = firstvec;
 		--nodes;
 	}
-	return done;
+	return done < numvecs ? done : numvecs;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio
  2019-08-12  9:57 [PATCH V2 0/3] genriq/affinity: Make vectors allocation fair Ming Lei
  2019-08-12  9:57 ` [PATCH V2 1/3] genirq/affinity: Improve __irq_build_affinity_masks() Ming Lei
@ 2019-08-12  9:57 ` Ming Lei
  2019-08-12 15:27   ` Keith Busch
  2019-08-12  9:57 ` [PATCH V2 3/3] genirq/affinity: Enhance warning check Ming Lei
  2 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2019-08-12  9:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Christoph Hellwig, Keith Busch,
	linux-nvme, Jon Derrick, Jens Axboe

Now __irq_build_affinity_masks() spreads vectors evenly per node, and
all vectors may not be spread in case that each numa node has different
CPU number, then the following warning in irq_build_affinity_masks() can
be triggered:

	if (nr_present < numvecs)
		WARN_ON(nr_present + nr_others < numvecs);

Improve current spreading algorithm by assigning vectors according to
the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
assignment from smaller nodes to bigger nodes to guarantee that every
active node gets allocated at least one vector, then we can avoid
cross-node spread.

Meantime the reported warning can be fixed.

Another big goodness is that the spread approach becomes more fair if
node has different CPU number.

For example, on the following machine:
	[root@ktest-01 ~]# lscpu
	...
	CPU(s):              16
	On-line CPU(s) list: 0-15
	Thread(s) per core:  1
	Core(s) per socket:  8
	Socket(s):           2
	NUMA node(s):        2
	...
	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
	NUMA node1 CPU(s):   2,4,10,12

When driver requests to allocate 8 vectors, the following spread can
be got:
	irq 31, cpu list 2,4
	irq 32, cpu list 10,12
	irq 33, cpu list 0-1
	irq 34, cpu list 3,5
	irq 35, cpu list 6-7
	irq 36, cpu list 8-9
	irq 37, cpu list 11,13
	irq 38, cpu list 14-15

Without this patch, kernel warning is triggered on above situation, and
allocation result was supposed to be 4 vectors for each node.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org,
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 141 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 117 insertions(+), 24 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index c7cca942bd8a..927dcbe80482 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/sort.h>
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_vec)
@@ -94,6 +95,87 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
 	return nodes;
 }
 
+struct node_nr_vectors {
+	unsigned n;
+
+	union {
+		unsigned nvectors;
+		unsigned ncpus;
+	};
+};
+
+static int ncpus_cmp_func(const void *l, const void *r)
+{
+	const struct node_nr_vectors *ln = l;
+	const struct node_nr_vectors *rn = r;
+
+	if (ln->ncpus < rn->ncpus)
+		return -1;
+	if (ln->ncpus > rn->ncpus)
+		return 1;
+	return 0;
+}
+
+static void alloc_nodes_vectors(unsigned int numvecs,
+				const cpumask_var_t *node_to_cpumask,
+				const struct cpumask *cpu_mask,
+				const nodemask_t nodemsk,
+				struct cpumask *nmsk,
+				struct node_nr_vectors *node_vectors)
+{
+	unsigned remaining_ncpus = 0;
+	unsigned n;
+
+	for (n = 0; n < nr_node_ids; n++) {
+		node_vectors[n].n = n;
+		node_vectors[n].ncpus = UINT_MAX;
+	}
+
+	for_each_node_mask(n, nodemsk) {
+		unsigned ncpus;
+
+		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
+		ncpus = cpumask_weight(nmsk);
+
+		if (!ncpus)
+			continue;
+		remaining_ncpus += ncpus;
+		node_vectors[n].ncpus = ncpus;
+	}
+
+	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
+	     ncpus_cmp_func, NULL);
+
+	/*
+	 * Allocate vectors for each node according to the ratio of this
+	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
+	 * bigger than number of active numa nodes. Always start the
+	 * allocation from the node with minimized nr_cpus.
+	 *
+	 * This way guarantees that each active node gets allocated at
+	 * least one vector, and the theory is simple: over-allocation
+	 * is only done when this node is assigned by one vector, so
+	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
+	 * bigger than number of numa nodes.
+	 */
+	for (n = 0; n < nr_node_ids; n++) {
+		unsigned nvectors, ncpus;
+
+		if (node_vectors[n].ncpus == UINT_MAX)
+			continue;
+
+		WARN_ON_ONCE(numvecs == 0);
+
+		ncpus = node_vectors[n].ncpus;
+		nvectors = max_t(unsigned, 1,
+				 numvecs * ncpus / remaining_ncpus);
+
+		node_vectors[n].nvectors = nvectors;
+		remaining_ncpus -= ncpus;
+		numvecs -= nvectors;
+	}
+}
+
 static int __irq_build_affinity_masks(unsigned int startvec,
 				      unsigned int numvecs,
 				      unsigned int firstvec,
@@ -102,10 +184,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 				      struct cpumask *nmsk,
 				      struct irq_affinity_desc *masks)
 {
-	unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
+	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
 	unsigned int last_affv = firstvec + numvecs;
 	unsigned int curvec = startvec;
 	nodemask_t nodemsk = NODE_MASK_NONE;
+	struct node_nr_vectors *node_vectors;
 
 	if (!cpumask_weight(cpu_mask))
 		return 0;
@@ -126,8 +209,23 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 		return numvecs;
 	}
 
-	for_each_node_mask(n, nodemsk) {
-		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
+	node_vectors = kcalloc(nr_node_ids,
+			       sizeof(struct node_nr_vectors),
+			       GFP_KERNEL);
+	if (!node_vectors)
+		return 0;
+
+	alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
+			    nodemsk, nmsk, node_vectors);
+
+	for (i = 0; i < nr_node_ids; i++) {
+		unsigned int ncpus, v, vecs_to_assign;
+		struct node_nr_vectors *nv = &node_vectors[i];
+
+		if (nv->nvectors == UINT_MAX)
+			continue;
+
+		n = nv->n;
 
 		/* Get the cpus on this node which are in the mask */
 		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
@@ -135,27 +233,14 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 		if (!ncpus)
 			continue;
 
-		/*
-		 * Calculate the number of cpus per vector
-		 *
-		 * Spread the vectors evenly per node. If the requested
-		 * vector number has been reached, simply allocate one
-		 * vector for each remaining node so that all nodes can
-		 * be covered
-		 */
-		if (numvecs > done)
-			vecs_per_node = max_t(unsigned,
-					(numvecs - done) / nodes, 1);
-		else
-			vecs_per_node = 1;
-
-		vecs_to_assign = min(vecs_per_node, ncpus);
+		WARN_ON_ONCE(nv->nvectors == UINT_MAX);
+
+		vecs_to_assign = min(nv->nvectors, ncpus);
 
 		/* Account for rounding errors */
 		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
 
-		for (v = 0; curvec < last_affv && v < vecs_to_assign;
-		     curvec++, v++) {
+		for (v = 0; v < vecs_to_assign; v++, curvec++) {
 			cpus_per_vec = ncpus / vecs_to_assign;
 
 			/* Account for extra vectors to compensate rounding errors */
@@ -165,13 +250,21 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 			}
 			irq_spread_init_one(&masks[curvec].mask, nmsk,
 						cpus_per_vec);
+			/*
+			 * alloc_nodes_vectors() is intelligent enough to
+			 * allocate vectors on all nodes, so wrapping
+			 * shouldn't be triggered usually. However, if it
+			 * happens when allocated vectors is bigger than
+			 * node's CPU number becasue of round down, wraps
+			 * to the first vector allocated for this node, then
+			 * cross-node spread can be avoided.
+			 */
+			if (curvec >= last_affv)
+				curvec -= v;
 		}
-
 		done += v;
-		if (curvec >= last_affv)
-			curvec = firstvec;
-		--nodes;
 	}
+	kfree(node_vectors);
 	return done < numvecs ? done : numvecs;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V2 3/3] genirq/affinity: Enhance warning check
  2019-08-12  9:57 [PATCH V2 0/3] genriq/affinity: Make vectors allocation fair Ming Lei
  2019-08-12  9:57 ` [PATCH V2 1/3] genirq/affinity: Improve __irq_build_affinity_masks() Ming Lei
  2019-08-12  9:57 ` [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio Ming Lei
@ 2019-08-12  9:57 ` Ming Lei
  2 siblings, 0 replies; 8+ messages in thread
From: Ming Lei @ 2019-08-12  9:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Christoph Hellwig, Keith Busch,
	linux-nvme, Jon Derrick, Jens Axboe

The two-stage spread is done on same irq vectors, and we just need that
either one stage covers all vector, not two stage work together to cover
all vectors.

So enhance the warning check to make sure all vectors are spread.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org,
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Fixes: 6da4b3ab9a6 ("genirq/affinity: Add support for allocating interrupt sets")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/affinity.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 927dcbe80482..178dc7eb7b35 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -318,8 +318,7 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 					       npresmsk, nmsk, masks);
 	put_online_cpus();
 
-	if (nr_present < numvecs)
-		WARN_ON(nr_present + nr_others < numvecs);
+	WARN_ON(max(nr_present, nr_others) < numvecs);
 
 	free_node_to_cpumask(node_to_cpumask);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio
  2019-08-12  9:57 ` [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio Ming Lei
@ 2019-08-12 15:27   ` Keith Busch
  2019-08-13  7:41     ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2019-08-12 15:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, Jens Axboe, linux-kernel, linux-nvme,
	Christoph Hellwig, Jon Derrick

On Mon, Aug 12, 2019 at 05:57:08PM +0800, Ming Lei wrote:
> Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> all vectors may not be spread in case that each numa node has different
> CPU number, then the following warning in irq_build_affinity_masks() can
> be triggered:
> 
> 	if (nr_present < numvecs)
> 		WARN_ON(nr_present + nr_others < numvecs);
> 
> Improve current spreading algorithm by assigning vectors according to
> the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> assignment from smaller nodes to bigger nodes to guarantee that every
> active node gets allocated at least one vector, then we can avoid
> cross-node spread.
> 
> Meantime the reported warning can be fixed.
> 
> Another big goodness is that the spread approach becomes more fair if
> node has different CPU number.
> 
> For example, on the following machine:
> 	[root@ktest-01 ~]# lscpu
> 	...
> 	CPU(s):              16
> 	On-line CPU(s) list: 0-15
> 	Thread(s) per core:  1
> 	Core(s) per socket:  8
> 	Socket(s):           2
> 	NUMA node(s):        2
> 	...
> 	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> 	NUMA node1 CPU(s):   2,4,10,12
> 
> When driver requests to allocate 8 vectors, the following spread can
> be got:
> 	irq 31, cpu list 2,4
> 	irq 32, cpu list 10,12
> 	irq 33, cpu list 0-1
> 	irq 34, cpu list 3,5
> 	irq 35, cpu list 6-7
> 	irq 36, cpu list 8-9
> 	irq 37, cpu list 11,13
> 	irq 38, cpu list 14-15
> 
> Without this patch, kernel warning is triggered on above situation, and
> allocation result was supposed to be 4 vectors for each node.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: linux-nvme@lists.infradead.org,
> Cc: Jon Derrick <jonathan.derrick@intel.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Reported-by: Jon Derrick <jonathan.derrick@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  kernel/irq/affinity.c | 141 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 117 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index c7cca942bd8a..927dcbe80482 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -7,6 +7,7 @@
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/sort.h>
>  
>  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  				unsigned int cpus_per_vec)
> @@ -94,6 +95,87 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
>  	return nodes;
>  }
>  
> +struct node_nr_vectors {
> +	unsigned n;
> +
> +	union {
> +		unsigned nvectors;
> +		unsigned ncpus;
> +	};
> +};
> +
> +static int ncpus_cmp_func(const void *l, const void *r)
> +{
> +	const struct node_nr_vectors *ln = l;
> +	const struct node_nr_vectors *rn = r;
> +
> +	if (ln->ncpus < rn->ncpus)
> +		return -1;
> +	if (ln->ncpus > rn->ncpus)
> +		return 1;
> +	return 0;

You can collapse these to one line:

	return ln->ncpus - rn->ncpus;

> +}
> +
> +static void alloc_nodes_vectors(unsigned int numvecs,
> +				const cpumask_var_t *node_to_cpumask,
> +				const struct cpumask *cpu_mask,
> +				const nodemask_t nodemsk,
> +				struct cpumask *nmsk,
> +				struct node_nr_vectors *node_vectors)
> +{
> +	unsigned remaining_ncpus = 0;
> +	unsigned n;
> +
> +	for (n = 0; n < nr_node_ids; n++) {
> +		node_vectors[n].n = n;
> +		node_vectors[n].ncpus = UINT_MAX;
> +	}
> +
> +	for_each_node_mask(n, nodemsk) {
> +		unsigned ncpus;
> +
> +		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
> +		ncpus = cpumask_weight(nmsk);
> +
> +		if (!ncpus)
> +			continue;
> +		remaining_ncpus += ncpus;
> +		node_vectors[n].ncpus = ncpus;
> +	}
> +
> +	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
> +	     ncpus_cmp_func, NULL);
> +
> +	/*
> +	 * Allocate vectors for each node according to the ratio of this
> +	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
> +	 * bigger than number of active numa nodes. Always start the
> +	 * allocation from the node with minimized nr_cpus.
> +	 *
> +	 * This way guarantees that each active node gets allocated at
> +	 * least one vector, and the theory is simple: over-allocation
> +	 * is only done when this node is assigned by one vector, so
> +	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
> +	 * bigger than number of numa nodes.
> +	 */
> +	for (n = 0; n < nr_node_ids; n++) {
> +		unsigned nvectors, ncpus;
> +
> +		if (node_vectors[n].ncpus == UINT_MAX)
> +			continue;
> +
> +		WARN_ON_ONCE(numvecs == 0);
> +
> +		ncpus = node_vectors[n].ncpus;
> +		nvectors = max_t(unsigned, 1,
> +				 numvecs * ncpus / remaining_ncpus);
> +
> +		node_vectors[n].nvectors = nvectors;
> +		remaining_ncpus -= ncpus;
> +		numvecs -= nvectors;
> +	}

This looks good to me.

> +}
> +
>  static int __irq_build_affinity_masks(unsigned int startvec,
>  				      unsigned int numvecs,
>  				      unsigned int firstvec,
> @@ -102,10 +184,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
>  				      struct cpumask *nmsk,
>  				      struct irq_affinity_desc *masks)
>  {
> -	unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
> +	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
>  	unsigned int last_affv = firstvec + numvecs;
>  	unsigned int curvec = startvec;
>  	nodemask_t nodemsk = NODE_MASK_NONE;
> +	struct node_nr_vectors *node_vectors;
>  
>  	if (!cpumask_weight(cpu_mask))
>  		return 0;
> @@ -126,8 +209,23 @@ static int __irq_build_affinity_masks(unsigned int startvec,
>  		return numvecs;
>  	}
>  
> -	for_each_node_mask(n, nodemsk) {
> -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> +	node_vectors = kcalloc(nr_node_ids,
> +			       sizeof(struct node_nr_vectors),
> +			       GFP_KERNEL);
> +	if (!node_vectors)
> +		return 0;

I think we need to get this -ENOMEM condition back to the caller and
have that condition handled.

> @@ -165,13 +250,21 @@ static int __irq_build_affinity_masks(unsigned int startvec,
>  			}
>  			irq_spread_init_one(&masks[curvec].mask, nmsk,
>  						cpus_per_vec);
> +			/*
> +			 * alloc_nodes_vectors() is intelligent enough to
> +			 * allocate vectors on all nodes, so wrapping
> +			 * shouldn't be triggered usually. However, if it
> +			 * happens when allocated vectors is bigger than
> +			 * node's CPU number becasue of round down, wraps
> +			 * to the first vector allocated for this node, then
> +			 * cross-node spread can be avoided.
> +			 */
> +			if (curvec >= last_affv)
> +				curvec -= v;

Could you explain again how this could happen? The round-down should mean we
apply a vector to more CPUs so that the number of vectors applied to a
node wthin the loop should never require wrapping to hit all those CPUs.
And if that's true, the check should probably be a warn because it
should never happen.

In any case, if you can hit that condition where curvec >= last_affv,
the assignment to masks[curvec] just above may be out-of-bounds.

>  		}
> -
>  		done += v;
> -		if (curvec >= last_affv)
> -			curvec = firstvec;
> -		--nodes;
>  	}
> +	kfree(node_vectors);
>  	return done < numvecs ? done : numvecs;
>  }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio
  2019-08-12 15:27   ` Keith Busch
@ 2019-08-13  7:41     ` Ming Lei
  2019-08-13  9:26       ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2019-08-13  7:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-kernel, linux-nvme, Thomas Gleixner,
	Christoph Hellwig, Jon Derrick

On Mon, Aug 12, 2019 at 09:27:18AM -0600, Keith Busch wrote:
> On Mon, Aug 12, 2019 at 05:57:08PM +0800, Ming Lei wrote:
> > Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> > all vectors may not be spread in case that each numa node has different
> > CPU number, then the following warning in irq_build_affinity_masks() can
> > be triggered:
> > 
> > 	if (nr_present < numvecs)
> > 		WARN_ON(nr_present + nr_others < numvecs);
> > 
> > Improve current spreading algorithm by assigning vectors according to
> > the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> > assignment from smaller nodes to bigger nodes to guarantee that every
> > active node gets allocated at least one vector, then we can avoid
> > cross-node spread.
> > 
> > Meantime the reported warning can be fixed.
> > 
> > Another big goodness is that the spread approach becomes more fair if
> > node has different CPU number.
> > 
> > For example, on the following machine:
> > 	[root@ktest-01 ~]# lscpu
> > 	...
> > 	CPU(s):              16
> > 	On-line CPU(s) list: 0-15
> > 	Thread(s) per core:  1
> > 	Core(s) per socket:  8
> > 	Socket(s):           2
> > 	NUMA node(s):        2
> > 	...
> > 	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> > 	NUMA node1 CPU(s):   2,4,10,12
> > 
> > When driver requests to allocate 8 vectors, the following spread can
> > be got:
> > 	irq 31, cpu list 2,4
> > 	irq 32, cpu list 10,12
> > 	irq 33, cpu list 0-1
> > 	irq 34, cpu list 3,5
> > 	irq 35, cpu list 6-7
> > 	irq 36, cpu list 8-9
> > 	irq 37, cpu list 11,13
> > 	irq 38, cpu list 14-15
> > 
> > Without this patch, kernel warning is triggered on above situation, and
> > allocation result was supposed to be 4 vectors for each node.
> > 
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: linux-nvme@lists.infradead.org,
> > Cc: Jon Derrick <jonathan.derrick@intel.com>
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Reported-by: Jon Derrick <jonathan.derrick@intel.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  kernel/irq/affinity.c | 141 +++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 117 insertions(+), 24 deletions(-)
> > 
> > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > index c7cca942bd8a..927dcbe80482 100644
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/kernel.h>
> >  #include <linux/slab.h>
> >  #include <linux/cpu.h>
> > +#include <linux/sort.h>
> >  
> >  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> >  				unsigned int cpus_per_vec)
> > @@ -94,6 +95,87 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
> >  	return nodes;
> >  }
> >  
> > +struct node_nr_vectors {
> > +	unsigned n;
> > +
> > +	union {
> > +		unsigned nvectors;
> > +		unsigned ncpus;
> > +	};
> > +};
> > +
> > +static int ncpus_cmp_func(const void *l, const void *r)
> > +{
> > +	const struct node_nr_vectors *ln = l;
> > +	const struct node_nr_vectors *rn = r;
> > +
> > +	if (ln->ncpus < rn->ncpus)
> > +		return -1;
> > +	if (ln->ncpus > rn->ncpus)
> > +		return 1;
> > +	return 0;
> 
> You can collapse these to one line:
> 
> 	return ln->ncpus - rn->ncpus;

OK.

> 
> > +}
> > +
> > +static void alloc_nodes_vectors(unsigned int numvecs,
> > +				const cpumask_var_t *node_to_cpumask,
> > +				const struct cpumask *cpu_mask,
> > +				const nodemask_t nodemsk,
> > +				struct cpumask *nmsk,
> > +				struct node_nr_vectors *node_vectors)
> > +{
> > +	unsigned remaining_ncpus = 0;
> > +	unsigned n;
> > +
> > +	for (n = 0; n < nr_node_ids; n++) {
> > +		node_vectors[n].n = n;
> > +		node_vectors[n].ncpus = UINT_MAX;
> > +	}
> > +
> > +	for_each_node_mask(n, nodemsk) {
> > +		unsigned ncpus;
> > +
> > +		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
> > +		ncpus = cpumask_weight(nmsk);
> > +
> > +		if (!ncpus)
> > +			continue;
> > +		remaining_ncpus += ncpus;
> > +		node_vectors[n].ncpus = ncpus;
> > +	}
> > +
> > +	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
> > +	     ncpus_cmp_func, NULL);
> > +
> > +	/*
> > +	 * Allocate vectors for each node according to the ratio of this
> > +	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
> > +	 * bigger than number of active numa nodes. Always start the
> > +	 * allocation from the node with minimized nr_cpus.
> > +	 *
> > +	 * This way guarantees that each active node gets allocated at
> > +	 * least one vector, and the theory is simple: over-allocation
> > +	 * is only done when this node is assigned by one vector, so
> > +	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
> > +	 * bigger than number of numa nodes.
> > +	 */
> > +	for (n = 0; n < nr_node_ids; n++) {
> > +		unsigned nvectors, ncpus;
> > +
> > +		if (node_vectors[n].ncpus == UINT_MAX)
> > +			continue;
> > +
> > +		WARN_ON_ONCE(numvecs == 0);
> > +
> > +		ncpus = node_vectors[n].ncpus;
> > +		nvectors = max_t(unsigned, 1,
> > +				 numvecs * ncpus / remaining_ncpus);
> > +
> > +		node_vectors[n].nvectors = nvectors;
> > +		remaining_ncpus -= ncpus;
> > +		numvecs -= nvectors;
> > +	}
> 
> This looks good to me.
> 
> > +}
> > +
> >  static int __irq_build_affinity_masks(unsigned int startvec,
> >  				      unsigned int numvecs,
> >  				      unsigned int firstvec,
> > @@ -102,10 +184,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> >  				      struct cpumask *nmsk,
> >  				      struct irq_affinity_desc *masks)
> >  {
> > -	unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
> > +	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
> >  	unsigned int last_affv = firstvec + numvecs;
> >  	unsigned int curvec = startvec;
> >  	nodemask_t nodemsk = NODE_MASK_NONE;
> > +	struct node_nr_vectors *node_vectors;
> >  
> >  	if (!cpumask_weight(cpu_mask))
> >  		return 0;
> > @@ -126,8 +209,23 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> >  		return numvecs;
> >  	}
> >  
> > -	for_each_node_mask(n, nodemsk) {
> > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > +	node_vectors = kcalloc(nr_node_ids,
> > +			       sizeof(struct node_nr_vectors),
> > +			       GFP_KERNEL);
> > +	if (!node_vectors)
> > +		return 0;
> 
> I think we need to get this -ENOMEM condition back to the caller and
> have that condition handled.

Good point.

> 
> > @@ -165,13 +250,21 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> >  			}
> >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> >  						cpus_per_vec);
> > +			/*
> > +			 * alloc_nodes_vectors() is intelligent enough to
> > +			 * allocate vectors on all nodes, so wrapping
> > +			 * shouldn't be triggered usually. However, if it
> > +			 * happens when allocated vectors is bigger than
> > +			 * node's CPU number becasue of round down, wraps
> > +			 * to the first vector allocated for this node, then
> > +			 * cross-node spread can be avoided.
> > +			 */
> > +			if (curvec >= last_affv)
> > +				curvec -= v;
> 
> Could you explain again how this could happen? The round-down should mean we
> apply a vector to more CPUs so that the number of vectors applied to a
> node wthin the loop should never require wrapping to hit all those CPUs.
> And if that's true, the check should probably be a warn because it
> should never happen.

You are right.

We should simply spread from the 1st vector for this node if there is
more vectors not done.

> 
> In any case, if you can hit that condition where curvec >= last_affv,
> the assignment to masks[curvec] just above may be out-of-bounds.

Yeah, 'curvec >= last_affv' shouldn't be needed.

Will fix them in V3.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio
  2019-08-13  7:41     ` Ming Lei
@ 2019-08-13  9:26       ` Ming Lei
  2019-08-13 13:25         ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2019-08-13  9:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-kernel, linux-nvme, Thomas Gleixner,
	Christoph Hellwig, Jon Derrick

On Tue, Aug 13, 2019 at 03:41:12PM +0800, Ming Lei wrote:
> On Mon, Aug 12, 2019 at 09:27:18AM -0600, Keith Busch wrote:
> > On Mon, Aug 12, 2019 at 05:57:08PM +0800, Ming Lei wrote:
> > > Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> > > all vectors may not be spread in case that each numa node has different
> > > CPU number, then the following warning in irq_build_affinity_masks() can
> > > be triggered:
> > > 
> > > 	if (nr_present < numvecs)
> > > 		WARN_ON(nr_present + nr_others < numvecs);
> > > 
> > > Improve current spreading algorithm by assigning vectors according to
> > > the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> > > assignment from smaller nodes to bigger nodes to guarantee that every
> > > active node gets allocated at least one vector, then we can avoid
> > > cross-node spread.
> > > 
> > > Meantime the reported warning can be fixed.
> > > 
> > > Another big goodness is that the spread approach becomes more fair if
> > > node has different CPU number.
> > > 
> > > For example, on the following machine:
> > > 	[root@ktest-01 ~]# lscpu
> > > 	...
> > > 	CPU(s):              16
> > > 	On-line CPU(s) list: 0-15
> > > 	Thread(s) per core:  1
> > > 	Core(s) per socket:  8
> > > 	Socket(s):           2
> > > 	NUMA node(s):        2
> > > 	...
> > > 	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> > > 	NUMA node1 CPU(s):   2,4,10,12
> > > 
> > > When driver requests to allocate 8 vectors, the following spread can
> > > be got:
> > > 	irq 31, cpu list 2,4
> > > 	irq 32, cpu list 10,12
> > > 	irq 33, cpu list 0-1
> > > 	irq 34, cpu list 3,5
> > > 	irq 35, cpu list 6-7
> > > 	irq 36, cpu list 8-9
> > > 	irq 37, cpu list 11,13
> > > 	irq 38, cpu list 14-15
> > > 
> > > Without this patch, kernel warning is triggered on above situation, and
> > > allocation result was supposed to be 4 vectors for each node.
> > > 
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Keith Busch <kbusch@kernel.org>
> > > Cc: linux-nvme@lists.infradead.org,
> > > Cc: Jon Derrick <jonathan.derrick@intel.com>
> > > Cc: Jens Axboe <axboe@kernel.dk>
> > > Reported-by: Jon Derrick <jonathan.derrick@intel.com>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  kernel/irq/affinity.c | 141 +++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 117 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > > index c7cca942bd8a..927dcbe80482 100644
> > > --- a/kernel/irq/affinity.c
> > > +++ b/kernel/irq/affinity.c
> > > @@ -7,6 +7,7 @@
> > >  #include <linux/kernel.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/cpu.h>
> > > +#include <linux/sort.h>
> > >  
> > >  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > >  				unsigned int cpus_per_vec)
> > > @@ -94,6 +95,87 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
> > >  	return nodes;
> > >  }
> > >  
> > > +struct node_nr_vectors {
> > > +	unsigned n;
> > > +
> > > +	union {
> > > +		unsigned nvectors;
> > > +		unsigned ncpus;
> > > +	};
> > > +};
> > > +
> > > +static int ncpus_cmp_func(const void *l, const void *r)
> > > +{
> > > +	const struct node_nr_vectors *ln = l;
> > > +	const struct node_nr_vectors *rn = r;
> > > +
> > > +	if (ln->ncpus < rn->ncpus)
> > > +		return -1;
> > > +	if (ln->ncpus > rn->ncpus)
> > > +		return 1;
> > > +	return 0;
> > 
> > You can collapse these to one line:
> > 
> > 	return ln->ncpus - rn->ncpus;
> 
> OK.
> 
> > 
> > > +}
> > > +
> > > +static void alloc_nodes_vectors(unsigned int numvecs,
> > > +				const cpumask_var_t *node_to_cpumask,
> > > +				const struct cpumask *cpu_mask,
> > > +				const nodemask_t nodemsk,
> > > +				struct cpumask *nmsk,
> > > +				struct node_nr_vectors *node_vectors)
> > > +{
> > > +	unsigned remaining_ncpus = 0;
> > > +	unsigned n;
> > > +
> > > +	for (n = 0; n < nr_node_ids; n++) {
> > > +		node_vectors[n].n = n;
> > > +		node_vectors[n].ncpus = UINT_MAX;
> > > +	}
> > > +
> > > +	for_each_node_mask(n, nodemsk) {
> > > +		unsigned ncpus;
> > > +
> > > +		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
> > > +		ncpus = cpumask_weight(nmsk);
> > > +
> > > +		if (!ncpus)
> > > +			continue;
> > > +		remaining_ncpus += ncpus;
> > > +		node_vectors[n].ncpus = ncpus;
> > > +	}
> > > +
> > > +	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
> > > +	     ncpus_cmp_func, NULL);
> > > +
> > > +	/*
> > > +	 * Allocate vectors for each node according to the ratio of this
> > > +	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
> > > +	 * bigger than number of active numa nodes. Always start the
> > > +	 * allocation from the node with minimized nr_cpus.
> > > +	 *
> > > +	 * This way guarantees that each active node gets allocated at
> > > +	 * least one vector, and the theory is simple: over-allocation
> > > +	 * is only done when this node is assigned by one vector, so
> > > +	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
> > > +	 * bigger than number of numa nodes.
> > > +	 */
> > > +	for (n = 0; n < nr_node_ids; n++) {
> > > +		unsigned nvectors, ncpus;
> > > +
> > > +		if (node_vectors[n].ncpus == UINT_MAX)
> > > +			continue;
> > > +
> > > +		WARN_ON_ONCE(numvecs == 0);
> > > +
> > > +		ncpus = node_vectors[n].ncpus;
> > > +		nvectors = max_t(unsigned, 1,
> > > +				 numvecs * ncpus / remaining_ncpus);
> > > +
> > > +		node_vectors[n].nvectors = nvectors;
> > > +		remaining_ncpus -= ncpus;
> > > +		numvecs -= nvectors;
> > > +	}
> > 
> > This looks good to me.
> > 
> > > +}
> > > +
> > >  static int __irq_build_affinity_masks(unsigned int startvec,
> > >  				      unsigned int numvecs,
> > >  				      unsigned int firstvec,
> > > @@ -102,10 +184,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  				      struct cpumask *nmsk,
> > >  				      struct irq_affinity_desc *masks)
> > >  {
> > > -	unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
> > > +	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
> > >  	unsigned int last_affv = firstvec + numvecs;
> > >  	unsigned int curvec = startvec;
> > >  	nodemask_t nodemsk = NODE_MASK_NONE;
> > > +	struct node_nr_vectors *node_vectors;
> > >  
> > >  	if (!cpumask_weight(cpu_mask))
> > >  		return 0;
> > > @@ -126,8 +209,23 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  		return numvecs;
> > >  	}
> > >  
> > > -	for_each_node_mask(n, nodemsk) {
> > > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > > +	node_vectors = kcalloc(nr_node_ids,
> > > +			       sizeof(struct node_nr_vectors),
> > > +			       GFP_KERNEL);
> > > +	if (!node_vectors)
> > > +		return 0;
> > 
> > I think we need to get this -ENOMEM condition back to the caller and
> > have that condition handled.
> 
> Good point.
> 
> > 
> > > @@ -165,13 +250,21 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  			}
> > >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> > >  						cpus_per_vec);
> > > +			/*
> > > +			 * alloc_nodes_vectors() is intelligent enough to
> > > +			 * allocate vectors on all nodes, so wrapping
> > > +			 * shouldn't be triggered usually. However, if it
> > > +			 * happens when allocated vectors is bigger than
> > > +			 * node's CPU number becasue of round down, wraps
> > > +			 * to the first vector allocated for this node, then
> > > +			 * cross-node spread can be avoided.
> > > +			 */
> > > +			if (curvec >= last_affv)
> > > +				curvec -= v;
> > 
> > Could you explain again how this could happen? The round-down should mean we
> > apply a vector to more CPUs so that the number of vectors applied to a
> > node wthin the loop should never require wrapping to hit all those CPUs.
> > And if that's true, the check should probably be a warn because it
> > should never happen.
> 
> You are right.
> 
> We should simply spread from the 1st vector for this node if there is
> more vectors not done.

oops, the above approach is wrong, same with V3.

It should happen on just the node with max CPU count, then we can't
go ahead. There are more vectors than CPU count on this node, all
vectors have to be spread, each CPU can be assigned to only one vector.

However, can't observe such problem at all, I need to think about if
it is possible from viewpoint of math.


thanks,
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio
  2019-08-13  9:26       ` Ming Lei
@ 2019-08-13 13:25         ` Ming Lei
  0 siblings, 0 replies; 8+ messages in thread
From: Ming Lei @ 2019-08-13 13:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-kernel, linux-nvme, Thomas Gleixner,
	Christoph Hellwig, Jon Derrick

On Tue, Aug 13, 2019 at 05:26:51PM +0800, Ming Lei wrote:
> On Tue, Aug 13, 2019 at 03:41:12PM +0800, Ming Lei wrote:
> > On Mon, Aug 12, 2019 at 09:27:18AM -0600, Keith Busch wrote:
> > > On Mon, Aug 12, 2019 at 05:57:08PM +0800, Ming Lei wrote:
> > > > Now __irq_build_affinity_masks() spreads vectors evenly per node, and
> > > > all vectors may not be spread in case that each numa node has different
> > > > CPU number, then the following warning in irq_build_affinity_masks() can
> > > > be triggered:
> > > > 
> > > > 	if (nr_present < numvecs)
> > > > 		WARN_ON(nr_present + nr_others < numvecs);
> > > > 
> > > > Improve current spreading algorithm by assigning vectors according to
> > > > the ratio of node's nr_cpu to nr_remaining_cpus, meantime running the
> > > > assignment from smaller nodes to bigger nodes to guarantee that every
> > > > active node gets allocated at least one vector, then we can avoid
> > > > cross-node spread.
> > > > 
> > > > Meantime the reported warning can be fixed.
> > > > 
> > > > Another big goodness is that the spread approach becomes more fair if
> > > > node has different CPU number.
> > > > 
> > > > For example, on the following machine:
> > > > 	[root@ktest-01 ~]# lscpu
> > > > 	...
> > > > 	CPU(s):              16
> > > > 	On-line CPU(s) list: 0-15
> > > > 	Thread(s) per core:  1
> > > > 	Core(s) per socket:  8
> > > > 	Socket(s):           2
> > > > 	NUMA node(s):        2
> > > > 	...
> > > > 	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
> > > > 	NUMA node1 CPU(s):   2,4,10,12
> > > > 
> > > > When driver requests to allocate 8 vectors, the following spread can
> > > > be got:
> > > > 	irq 31, cpu list 2,4
> > > > 	irq 32, cpu list 10,12
> > > > 	irq 33, cpu list 0-1
> > > > 	irq 34, cpu list 3,5
> > > > 	irq 35, cpu list 6-7
> > > > 	irq 36, cpu list 8-9
> > > > 	irq 37, cpu list 11,13
> > > > 	irq 38, cpu list 14-15
> > > > 
> > > > Without this patch, kernel warning is triggered on above situation, and
> > > > allocation result was supposed to be 4 vectors for each node.
> > > > 
> > > > Cc: Christoph Hellwig <hch@lst.de>
> > > > Cc: Keith Busch <kbusch@kernel.org>
> > > > Cc: linux-nvme@lists.infradead.org,
> > > > Cc: Jon Derrick <jonathan.derrick@intel.com>
> > > > Cc: Jens Axboe <axboe@kernel.dk>
> > > > Reported-by: Jon Derrick <jonathan.derrick@intel.com>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  kernel/irq/affinity.c | 141 +++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 117 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > > > index c7cca942bd8a..927dcbe80482 100644
> > > > --- a/kernel/irq/affinity.c
> > > > +++ b/kernel/irq/affinity.c
> > > > @@ -7,6 +7,7 @@
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/slab.h>
> > > >  #include <linux/cpu.h>
> > > > +#include <linux/sort.h>
> > > >  
> > > >  static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> > > >  				unsigned int cpus_per_vec)
> > > > @@ -94,6 +95,87 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
> > > >  	return nodes;
> > > >  }
> > > >  
> > > > +struct node_nr_vectors {
> > > > +	unsigned n;
> > > > +
> > > > +	union {
> > > > +		unsigned nvectors;
> > > > +		unsigned ncpus;
> > > > +	};
> > > > +};
> > > > +
> > > > +static int ncpus_cmp_func(const void *l, const void *r)
> > > > +{
> > > > +	const struct node_nr_vectors *ln = l;
> > > > +	const struct node_nr_vectors *rn = r;
> > > > +
> > > > +	if (ln->ncpus < rn->ncpus)
> > > > +		return -1;
> > > > +	if (ln->ncpus > rn->ncpus)
> > > > +		return 1;
> > > > +	return 0;
> > > 
> > > You can collapse these to one line:
> > > 
> > > 	return ln->ncpus - rn->ncpus;
> > 
> > OK.
> > 
> > > 
> > > > +}
> > > > +
> > > > +static void alloc_nodes_vectors(unsigned int numvecs,
> > > > +				const cpumask_var_t *node_to_cpumask,
> > > > +				const struct cpumask *cpu_mask,
> > > > +				const nodemask_t nodemsk,
> > > > +				struct cpumask *nmsk,
> > > > +				struct node_nr_vectors *node_vectors)
> > > > +{
> > > > +	unsigned remaining_ncpus = 0;
> > > > +	unsigned n;
> > > > +
> > > > +	for (n = 0; n < nr_node_ids; n++) {
> > > > +		node_vectors[n].n = n;
> > > > +		node_vectors[n].ncpus = UINT_MAX;
> > > > +	}
> > > > +
> > > > +	for_each_node_mask(n, nodemsk) {
> > > > +		unsigned ncpus;
> > > > +
> > > > +		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
> > > > +		ncpus = cpumask_weight(nmsk);
> > > > +
> > > > +		if (!ncpus)
> > > > +			continue;
> > > > +		remaining_ncpus += ncpus;
> > > > +		node_vectors[n].ncpus = ncpus;
> > > > +	}
> > > > +
> > > > +	sort(node_vectors, nr_node_ids, sizeof(node_vectors[0]),
> > > > +	     ncpus_cmp_func, NULL);
> > > > +
> > > > +	/*
> > > > +	 * Allocate vectors for each node according to the ratio of this
> > > > +	 * node's nr_cpus to remaining un-assigned ncpus. 'numvecs' is
> > > > +	 * bigger than number of active numa nodes. Always start the
> > > > +	 * allocation from the node with minimized nr_cpus.
> > > > +	 *
> > > > +	 * This way guarantees that each active node gets allocated at
> > > > +	 * least one vector, and the theory is simple: over-allocation
> > > > +	 * is only done when this node is assigned by one vector, so
> > > > +	 * other nodes will be allocated >= 1 vector, since 'numvecs' is
> > > > +	 * bigger than number of numa nodes.
> > > > +	 */
> > > > +	for (n = 0; n < nr_node_ids; n++) {
> > > > +		unsigned nvectors, ncpus;
> > > > +
> > > > +		if (node_vectors[n].ncpus == UINT_MAX)
> > > > +			continue;
> > > > +
> > > > +		WARN_ON_ONCE(numvecs == 0);
> > > > +
> > > > +		ncpus = node_vectors[n].ncpus;
> > > > +		nvectors = max_t(unsigned, 1,
> > > > +				 numvecs * ncpus / remaining_ncpus);
> > > > +
> > > > +		node_vectors[n].nvectors = nvectors;
> > > > +		remaining_ncpus -= ncpus;
> > > > +		numvecs -= nvectors;
> > > > +	}
> > > 
> > > This looks good to me.
> > > 
> > > > +}
> > > > +
> > > >  static int __irq_build_affinity_masks(unsigned int startvec,
> > > >  				      unsigned int numvecs,
> > > >  				      unsigned int firstvec,
> > > > @@ -102,10 +184,11 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > > >  				      struct cpumask *nmsk,
> > > >  				      struct irq_affinity_desc *masks)
> > > >  {
> > > > -	unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
> > > > +	unsigned int i, n, nodes, cpus_per_vec, extra_vecs, done = 0;
> > > >  	unsigned int last_affv = firstvec + numvecs;
> > > >  	unsigned int curvec = startvec;
> > > >  	nodemask_t nodemsk = NODE_MASK_NONE;
> > > > +	struct node_nr_vectors *node_vectors;
> > > >  
> > > >  	if (!cpumask_weight(cpu_mask))
> > > >  		return 0;
> > > > @@ -126,8 +209,23 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > > >  		return numvecs;
> > > >  	}
> > > >  
> > > > -	for_each_node_mask(n, nodemsk) {
> > > > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > > > +	node_vectors = kcalloc(nr_node_ids,
> > > > +			       sizeof(struct node_nr_vectors),
> > > > +			       GFP_KERNEL);
> > > > +	if (!node_vectors)
> > > > +		return 0;
> > > 
> > > I think we need to get this -ENOMEM condition back to the caller and
> > > have that condition handled.
> > 
> > Good point.
> > 
> > > 
> > > > @@ -165,13 +250,21 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > > >  			}
> > > >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> > > >  						cpus_per_vec);
> > > > +			/*
> > > > +			 * alloc_nodes_vectors() is intelligent enough to
> > > > +			 * allocate vectors on all nodes, so wrapping
> > > > +			 * shouldn't be triggered usually. However, if it
> > > > +			 * happens when allocated vectors is bigger than
> > > > +			 * node's CPU number becasue of round down, wraps
> > > > +			 * to the first vector allocated for this node, then
> > > > +			 * cross-node spread can be avoided.
> > > > +			 */
> > > > +			if (curvec >= last_affv)
> > > > +				curvec -= v;
> > > 
> > > Could you explain again how this could happen? The round-down should mean we
> > > apply a vector to more CPUs so that the number of vectors applied to a
> > > node wthin the loop should never require wrapping to hit all those CPUs.
> > > And if that's true, the check should probably be a warn because it
> > > should never happen.
> > 
> > You are right.
> > 
> > We should simply spread from the 1st vector for this node if there is
> > more vectors not done.
> 
> oops, the above approach is wrong, same with V3.
> 
> It should happen on just the node with max CPU count, then we can't
> go ahead. There are more vectors than CPU count on this node, all
> vectors have to be spread, each CPU can be assigned to only one vector.
> 
> However, can't observe such problem at all, I need to think about if
> it is possible from viewpoint of math.

Looks we can prove that vectors assigned to every node is <= nr_cpu of
this node, see the following:

1) suppose there are two nodes: A and B
- ncpu(X) is CPU count of node X
- vecs(X) is the vector count assigned to node X via this algorithm

	ncpu(A) + ncpu(B) = N 
	vecs(A) + vecs(B) = V
	vecs(A) = round_down(ncpu(A) * V / N)
	vecs(B) = V - vecs(A)

N and V are integer, and 2 <= V <= N, suppose V = N - delta, then
delta is still integer, and 0 <= delta <= N - 2 

let's prove:
	vecs(A) <= ncpu(A)
	vecs(B) <= ncpu(B)

2) prove vecs(A) <= ncpu(A)

vecs(A) =
	round_down(ncpu(A) * (N - delta) / N) =
		ncpu(A) - round_down(cpu(A) * delta / N)

so vecs(A) is <= ncpu(A)


3) prove vecs(B) <= ncpu(B)

vecs(B) = V - vecs(A)
	N - delta - ncpu(A) + round_down(cpu(A) * delta / N) = 
	ncpu(B) - delta + round_down(cpu(A) * delta / N) <=
	ncpu(B) - delta + cpu(A) * delta / N = 
	ncpu(B) - (1 - ncpu(A)/N) * delta <= ncpu(B)

because (1 - ncpu(A)/N) is > 0, and delta >= 0 too.

For nodes >= 3, it can be thought as one node and another big node given
that is exactly what this algorithm is implemented('remaining_ncpus' and
'numvecs' are updated in each loop), and finally for each node X, we can
prove vecs(X) <= ncpu(X).


thanks,
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-08-13 13:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-12  9:57 [PATCH V2 0/3] genriq/affinity: Make vectors allocation fair Ming Lei
2019-08-12  9:57 ` [PATCH V2 1/3] genirq/affinity: Improve __irq_build_affinity_masks() Ming Lei
2019-08-12  9:57 ` [PATCH V2 2/3] genirq/affinity: Spread vectors on node according to nr_cpu ratio Ming Lei
2019-08-12 15:27   ` Keith Busch
2019-08-13  7:41     ` Ming Lei
2019-08-13  9:26       ` Ming Lei
2019-08-13 13:25         ` Ming Lei
2019-08-12  9:57 ` [PATCH V2 3/3] genirq/affinity: Enhance warning check Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).