[PATCH] genirq/affinity: report extra vectors on uneven nodes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] genirq/affinity: report extra vectors on uneven nodes
@ 2019-08-07 20:10 Jon Derrick
  2019-08-08  7:04 ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Derrick @ 2019-08-07 20:10 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel; +Cc: linux-nvme, Jon Derrick

The current irq spreading algorithm spreads vectors amongst cpus evenly
per node. If a node has more cpus than another node, the extra vectors
being spread may not be reported back to the caller.

This is most apparent with the NVMe driver and nr_cpus < vectors, where
the underreporting results in the caller's WARN being triggered:

irq_build_affinity_masks()
...
	if (nr_present < numvecs)
		WARN_ON(nr_present + nr_others < numvecs);

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 kernel/irq/affinity.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4352b08ae48d..9beafb8c7e92 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -127,7 +127,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 	}
 
 	for_each_node_mask(n, nodemsk) {
-		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
+		unsigned int ncpus, v, vecs_to_assign, total_vecs_to_assign,
+			vecs_per_node;
 
 		/* Spread the vectors per node */
 		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
@@ -141,14 +142,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 
 		/* Account for rounding errors */
 		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
+		total_vecs_to_assign = vecs_to_assign + extra_vecs;
 
-		for (v = 0; curvec < last_affv && v < vecs_to_assign;
+		for (v = 0; curvec < last_affv && v < total_vecs_to_assign;
 		     curvec++, v++) {
 			cpus_per_vec = ncpus / vecs_to_assign;
 
 			/* Account for extra vectors to compensate rounding errors */
 			if (extra_vecs) {
 				cpus_per_vec++;
+				v++;
 				--extra_vecs;
 			}
 			irq_spread_init_one(&masks[curvec].mask, nmsk,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] genirq/affinity: report extra vectors on uneven nodes
  2019-08-07 20:10 [PATCH] genirq/affinity: report extra vectors on uneven nodes Jon Derrick
@ 2019-08-08  7:04 ` Thomas Gleixner
  2019-08-08 16:32   ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2019-08-08  7:04 UTC (permalink / raw)
  To: Jon Derrick; +Cc: LKML, linux-nvme, Ming Lei, Christoph Hellwig

On Wed, 7 Aug 2019, Jon Derrick wrote:

Cc+: Ming, Christoph.

Left context for reference.

> The current irq spreading algorithm spreads vectors amongst cpus evenly
> per node. If a node has more cpus than another node, the extra vectors
> being spread may not be reported back to the caller.
> 
> This is most apparent with the NVMe driver and nr_cpus < vectors, where
> the underreporting results in the caller's WARN being triggered:
> 
> irq_build_affinity_masks()
> ...
> 	if (nr_present < numvecs)
> 		WARN_ON(nr_present + nr_others < numvecs);
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> ---
>  kernel/irq/affinity.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 4352b08ae48d..9beafb8c7e92 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -127,7 +127,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
>  	}
>  
>  	for_each_node_mask(n, nodemsk) {
> -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> +		unsigned int ncpus, v, vecs_to_assign, total_vecs_to_assign,
> +			vecs_per_node;
>  
>  		/* Spread the vectors per node */
>  		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
> @@ -141,14 +142,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
>  
>  		/* Account for rounding errors */
>  		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
> +		total_vecs_to_assign = vecs_to_assign + extra_vecs;
>  
> -		for (v = 0; curvec < last_affv && v < vecs_to_assign;
> +		for (v = 0; curvec < last_affv && v < total_vecs_to_assign;
>  		     curvec++, v++) {
>  			cpus_per_vec = ncpus / vecs_to_assign;
>  
>  			/* Account for extra vectors to compensate rounding errors */
>  			if (extra_vecs) {
>  				cpus_per_vec++;
> +				v++;
>  				--extra_vecs;
>  			}
>  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> -- 
> 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] genirq/affinity: report extra vectors on uneven nodes
  2019-08-08  7:04 ` Thomas Gleixner
@ 2019-08-08 16:32   ` Keith Busch
  2019-08-08 22:46     ` Derrick, Jonathan
  2019-08-09  3:04     ` Ming Lei
  0 siblings, 2 replies; 6+ messages in thread
From: Keith Busch @ 2019-08-08 16:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jon Derrick, LKML, linux-nvme, Ming Lei, Christoph Hellwig

On Thu, Aug 08, 2019 at 09:04:28AM +0200, Thomas Gleixner wrote:
> On Wed, 7 Aug 2019, Jon Derrick wrote:
> > The current irq spreading algorithm spreads vectors amongst cpus evenly
> > per node. If a node has more cpus than another node, the extra vectors
> > being spread may not be reported back to the caller.
> > 
> > This is most apparent with the NVMe driver and nr_cpus < vectors, where
> > the underreporting results in the caller's WARN being triggered:
> > 
> > irq_build_affinity_masks()
> > ...
> > 	if (nr_present < numvecs)
> > 		WARN_ON(nr_present + nr_others < numvecs);
> > 
> > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > ---
> >  kernel/irq/affinity.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > index 4352b08ae48d..9beafb8c7e92 100644
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -127,7 +127,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> >  	}
> >  
> >  	for_each_node_mask(n, nodemsk) {
> > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > +		unsigned int ncpus, v, vecs_to_assign, total_vecs_to_assign,
> > +			vecs_per_node;
> >  
> >  		/* Spread the vectors per node */
> >  		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
> > @@ -141,14 +142,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> >  
> >  		/* Account for rounding errors */
> >  		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
> > +		total_vecs_to_assign = vecs_to_assign + extra_vecs;
> >  
> > -		for (v = 0; curvec < last_affv && v < vecs_to_assign;
> > +		for (v = 0; curvec < last_affv && v < total_vecs_to_assign;
> >  		     curvec++, v++) {
> >  			cpus_per_vec = ncpus / vecs_to_assign;
> >  
> >  			/* Account for extra vectors to compensate rounding errors */
> >  			if (extra_vecs) {
> >  				cpus_per_vec++;
> > +				v++;
> >  				--extra_vecs;
> >  			}
> >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> > -- 

This looks like it will break the spread to non-present CPUs since
it's not accurately reporting how many vectors were assigned for the
present spread.

I think the real problem is the spread's vecs_per_node doesn't account
which nodes contribute more CPUs than others. For example:

  Node 0 has 32 CPUs
  Node 1 has 8 CPUs
  Assign 32 vectors

The current algorithm assigns 16 vectors to node 0 because vecs_per_node
is calculated as 32 vectors / 2 nodes on the first iteration. The
subsequent iteration for node 1 gets 8 vectors because it has only 8
CPUs, leaving 8 vectors unassigned.

A more fair spread would give node 0 the remaining 8 vectors. This
optimization, however, is a bit more complex than the current algorithm,
which is probably why it wasn't done, so I think the warning should just
be removed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] genirq/affinity: report extra vectors on uneven nodes
  2019-08-08 16:32   ` Keith Busch
@ 2019-08-08 22:46     ` Derrick, Jonathan
  2019-08-08 23:08       ` Keith Busch
  2019-08-09  3:04     ` Ming Lei
  1 sibling, 1 reply; 6+ messages in thread
From: Derrick, Jonathan @ 2019-08-08 22:46 UTC (permalink / raw)
  To: kbusch, tglx; +Cc: hch, linux-kernel, linux-nvme, ming.lei

[-- Attachment #1: Type: text/plain, Size: 3268 bytes --]

On Thu, 2019-08-08 at 10:32 -0600, Keith Busch wrote:
> On Thu, Aug 08, 2019 at 09:04:28AM +0200, Thomas Gleixner wrote:
> > On Wed, 7 Aug 2019, Jon Derrick wrote:
> > > The current irq spreading algorithm spreads vectors amongst cpus evenly
> > > per node. If a node has more cpus than another node, the extra vectors
> > > being spread may not be reported back to the caller.
> > > 
> > > This is most apparent with the NVMe driver and nr_cpus < vectors, where
> > > the underreporting results in the caller's WARN being triggered:
> > > 
> > > irq_build_affinity_masks()
> > > ...
> > > 	if (nr_present < numvecs)
> > > 		WARN_ON(nr_present + nr_others < numvecs);
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  kernel/irq/affinity.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > > index 4352b08ae48d..9beafb8c7e92 100644
> > > --- a/kernel/irq/affinity.c
> > > +++ b/kernel/irq/affinity.c
> > > @@ -127,7 +127,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  	}
> > >  
> > >  	for_each_node_mask(n, nodemsk) {
> > > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > > +		unsigned int ncpus, v, vecs_to_assign, total_vecs_to_assign,
> > > +			vecs_per_node;
> > >  
> > >  		/* Spread the vectors per node */
> > >  		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
> > > @@ -141,14 +142,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  
> > >  		/* Account for rounding errors */
> > >  		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
> > > +		total_vecs_to_assign = vecs_to_assign + extra_vecs;
> > >  
> > > -		for (v = 0; curvec < last_affv && v < vecs_to_assign;
> > > +		for (v = 0; curvec < last_affv && v < total_vecs_to_assign;
> > >  		     curvec++, v++) {
> > >  			cpus_per_vec = ncpus / vecs_to_assign;
> > >  
> > >  			/* Account for extra vectors to compensate rounding errors */
> > >  			if (extra_vecs) {
> > >  				cpus_per_vec++;
> > > +				v++;
> > >  				--extra_vecs;
> > >  			}
> > >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> > > -- 
> 
> This looks like it will break the spread to non-present CPUs since
> it's not accurately reporting how many vectors were assigned for the
> present spread.
> 
> I think the real problem is the spread's vecs_per_node doesn't account
> which nodes contribute more CPUs than others. For example:
> 
>   Node 0 has 32 CPUs
>   Node 1 has 8 CPUs
>   Assign 32 vectors
> 
> The current algorithm assigns 16 vectors to node 0 because vecs_per_node
> is calculated as 32 vectors / 2 nodes on the first iteration. The
> subsequent iteration for node 1 gets 8 vectors because it has only 8
> CPUs, leaving 8 vectors unassigned.
> 
> A more fair spread would give node 0 the remaining 8 vectors. This
> optimization, however, is a bit more complex than the current algorithm,
> which is probably why it wasn't done, so I think the warning should just
> be removed.

It does get a bit complex for the rare scenario in this case
Maybe just an informational warning rather than a stackdumping warning

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3278 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] genirq/affinity: report extra vectors on uneven nodes
  2019-08-08 22:46     ` Derrick, Jonathan
@ 2019-08-08 23:08       ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2019-08-08 23:08 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: tglx, ming.lei, hch, linux-nvme, linux-kernel

On Thu, Aug 08, 2019 at 10:46:06PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-08-08 at 10:32 -0600, Keith Busch wrote:
> > 
> > I think the real problem is the spread's vecs_per_node doesn't account
> > which nodes contribute more CPUs than others. For example:
> > 
> >   Node 0 has 32 CPUs
> >   Node 1 has 8 CPUs
> >   Assign 32 vectors
> > 
> > The current algorithm assigns 16 vectors to node 0 because vecs_per_node
> > is calculated as 32 vectors / 2 nodes on the first iteration. The
> > subsequent iteration for node 1 gets 8 vectors because it has only 8
> > CPUs, leaving 8 vectors unassigned.
> > 
> > A more fair spread would give node 0 the remaining 8 vectors. This
> > optimization, however, is a bit more complex than the current algorithm,
> > which is probably why it wasn't done, so I think the warning should just
> > be removed.
> 
> It does get a bit complex for the rare scenario in this case
> Maybe just an informational warning rather than a stackdumping warning

I think the easiest way to ensure all vectors are assigned is iterate
the nodes in a sorted order from fewest CPUs to most. That should fix
the warning, though it may not have the best possible assignment ratio
(but better than what we're currently doing).

Unfortunately the kernel's sort() doesn't take a 'void *priv' for the
compare callback, so we wouldn't have all the information needed to weigh
each node, but maybe we can fix that if there's agreement to iterate
the nodes this way.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] genirq/affinity: report extra vectors on uneven nodes
  2019-08-08 16:32   ` Keith Busch
  2019-08-08 22:46     ` Derrick, Jonathan
@ 2019-08-09  3:04     ` Ming Lei
  1 sibling, 0 replies; 6+ messages in thread
From: Ming Lei @ 2019-08-09  3:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Thomas Gleixner, linux-nvme, Christoph Hellwig, LKML, Jon Derrick

On Thu, Aug 08, 2019 at 10:32:24AM -0600, Keith Busch wrote:
> On Thu, Aug 08, 2019 at 09:04:28AM +0200, Thomas Gleixner wrote:
> > On Wed, 7 Aug 2019, Jon Derrick wrote:
> > > The current irq spreading algorithm spreads vectors amongst cpus evenly
> > > per node. If a node has more cpus than another node, the extra vectors
> > > being spread may not be reported back to the caller.
> > > 
> > > This is most apparent with the NVMe driver and nr_cpus < vectors, where
> > > the underreporting results in the caller's WARN being triggered:
> > > 
> > > irq_build_affinity_masks()
> > > ...
> > > 	if (nr_present < numvecs)
> > > 		WARN_ON(nr_present + nr_others < numvecs);
> > > 
> > > Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
> > > ---
> > >  kernel/irq/affinity.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > > index 4352b08ae48d..9beafb8c7e92 100644
> > > --- a/kernel/irq/affinity.c
> > > +++ b/kernel/irq/affinity.c
> > > @@ -127,7 +127,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  	}
> > >  
> > >  	for_each_node_mask(n, nodemsk) {
> > > -		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
> > > +		unsigned int ncpus, v, vecs_to_assign, total_vecs_to_assign,
> > > +			vecs_per_node;
> > >  
> > >  		/* Spread the vectors per node */
> > >  		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
> > > @@ -141,14 +142,16 @@ static int __irq_build_affinity_masks(unsigned int startvec,
> > >  
> > >  		/* Account for rounding errors */
> > >  		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
> > > +		total_vecs_to_assign = vecs_to_assign + extra_vecs;
> > >  
> > > -		for (v = 0; curvec < last_affv && v < vecs_to_assign;
> > > +		for (v = 0; curvec < last_affv && v < total_vecs_to_assign;
> > >  		     curvec++, v++) {
> > >  			cpus_per_vec = ncpus / vecs_to_assign;
> > >  
> > >  			/* Account for extra vectors to compensate rounding errors */
> > >  			if (extra_vecs) {
> > >  				cpus_per_vec++;
> > > +				v++;
> > >  				--extra_vecs;
> > >  			}
> > >  			irq_spread_init_one(&masks[curvec].mask, nmsk,
> > > -- 
> 
> This looks like it will break the spread to non-present CPUs since
> it's not accurately reporting how many vectors were assigned for the
> present spread.
> 
> I think the real problem is the spread's vecs_per_node doesn't account
> which nodes contribute more CPUs than others. For example:
> 
>   Node 0 has 32 CPUs
>   Node 1 has 8 CPUs
>   Assign 32 vectors
> 
> The current algorithm assigns 16 vectors to node 0 because vecs_per_node
> is calculated as 32 vectors / 2 nodes on the first iteration. The
> subsequent iteration for node 1 gets 8 vectors because it has only 8
> CPUs, leaving 8 vectors unassigned.
> 
> A more fair spread would give node 0 the remaining 8 vectors. This
> optimization, however, is a bit more complex than the current algorithm,
> which is probably why it wasn't done, so I think the warning should just
> be removed.

Another policy is to assign vectors among nodes according to the
following ratio:

	ncpus in this node / total ncpus in un-assigned nodes  

I have tried the following patch, looks it works fine:

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6fef48033f96..a598f20701a3 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,6 +94,28 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
 	return nodes;
 }
 
+static int nodes_cpus(unsigned start_node, const nodemask_t nodemsk,
+		const cpumask_var_t *node_to_cpumask,
+		const struct cpumask *cpu_mask, struct cpumask *nmsk)
+{
+	unsigned n, ncpus, total_cpus = 0;
+
+	for_each_node_mask(n, nodemsk) {
+		if (n < start_node)
+			continue;
+
+		/* Get the cpus on this node which are in the mask */
+		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
+
+		/* Calculate the number of cpus per vector */
+		ncpus = cpumask_weight(nmsk);
+
+		total_cpus += ncpus;
+	}
+
+	return total_cpus;
+}
+
 static int __irq_build_affinity_masks(unsigned int startvec,
 				      unsigned int numvecs,
 				      unsigned int firstvec,
@@ -128,15 +150,25 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 
 	for_each_node_mask(n, nodemsk) {
 		unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
-
-		/* Spread the vectors per node */
-		vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
+		unsigned int ncpus_left = nodes_cpus(n, nodemsk,
+				node_to_cpumask, cpu_mask, nmsk);
 
 		/* Get the cpus on this node which are in the mask */
 		cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
+
+		/*
+		 * Spread the vectors per node, and node with more CPUs will be
+		 * assigned to more vectors
+		 */
+		vecs_per_node = (numvecs - (curvec - firstvec)) * ncpus / ncpus_left;
+
+		/* at least assign one vector for this node */
+		if (!vecs_per_node)
+			vecs_per_node = 1;
+
 		vecs_to_assign = min(vecs_per_node, ncpus);
 
 		/* Account for rounding errors */
@@ -160,7 +192,6 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 			break;
 		if (curvec >= last_affv)
 			curvec = firstvec;
-		--nodes;
 	}
 	return done;
 }


thanks,
Ming

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-08-09  3:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-07 20:10 [PATCH] genirq/affinity: report extra vectors on uneven nodes Jon Derrick
2019-08-08  7:04 ` Thomas Gleixner
2019-08-08 16:32   ` Keith Busch
2019-08-08 22:46     ` Derrick, Jonathan
2019-08-08 23:08       ` Keith Busch
2019-08-09  3:04     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).