linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][RFC] x86/irq: Spread vectors on different CPUs
@ 2017-05-13 12:40 Chen Yu
  2017-06-28 19:03 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Chen Yu @ 2017-05-13 12:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Mika Westerberg, Rui Zhang, Chen Yu,
	Thomas Gleixner, Ingo Molnar, Anvin, H Peter, Van De Ven, Arjan,
	Brown, Len, Wysocki, Rafael J

Currently we encountered the CPU offline problem on a 16 cores server
when doing hibernation:

CPU 31 disable failed: CPU has 62 vectors assigned and there are only 0 available.

This is because:
1. One of the drivers has declare many vector resource via
   pci_enable_msix_range(), say, this driver might likely want
   to reserve 6 per logical CPU, then there would be 192 of them.
2. Besides, most of the vectors for this driver are allocated
   on CPU0 due to the current code strategy, so there would be
   insufficient slots left on CPU0 to receive any migrated
   IRQs from the other CPUs during CPU offine.
3. Furthermore, many vectors this driver reserved do no have
   any IRQ handler attached.

As a result, all vectors on CPU0 were used out and the last alive
CPU (31) failed to migrate its IRQs to the CPU0.

As we might have difficulty to reduce the number of vectors reserved
by that driver, there could be a compromising solution that, to spread
the vector allocation on different CPUs rather than always choosing
the *first* CPU in the cpumask. In this way, there would be a balanced
vector distribution. Because many vectors reserved but without used(point 3
above) will not be counted in during CPU offline, and they are now
on nonboot CPUs this problem will be solved.

Here's the trial version of this proposal and it works in my case,
it just tries to find the target CPU with the least vectors allocated(AKA,
the 'idlest' CPU). And the algorithm can be optimized but
firstly I'd like to get suggestions from you experts if this is in
the right direction, or what is the proper solution for such kind
of problems? Any comments/suggestions are appreciated.

Reported-by: Xiang Li <xiang.z.li@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Anvin, H Peter" <h.peter.anvin@intel.com>
Cc: "Van De Ven, Arjan" <arjan.van.de.ven@intel.com>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: "Wysocki, Rafael J" <rafael.j.wysocki@intel.com>
Cc: x86@kernel.org
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 arch/x86/kernel/apic/vector.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f3557a1..d220365 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -102,6 +102,27 @@ static void free_apic_chip_data(struct apic_chip_data *data)
 	}
 }
 
+static int pick_leisure_cpu(const struct cpumask *mask)
+{
+	int cpu, vector;
+	int min_nr_vector = NR_VECTORS;
+	int target_cpu = cpumask_first_and(mask, cpu_online_mask);
+
+	for_each_cpu_and(cpu, mask, cpu_online_mask) {
+		int nr_vectors = 0;
+
+		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
+			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
+				nr_vectors++;
+		}
+		if (nr_vectors < min_nr_vector) {
+			min_nr_vector = nr_vectors;
+			target_cpu = cpu;
+		}
+	}
+	return target_cpu;
+}
+
 static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 			       const struct cpumask *mask)
 {
@@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	/* Only try and allocate irqs on cpus that are present */
 	cpumask_clear(d->old_domain);
 	cpumask_clear(searched_cpumask);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = pick_leisure_cpu(mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, offset;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs
  2017-05-13 12:40 [PATCH][RFC] x86/irq: Spread vectors on different CPUs Chen Yu
@ 2017-06-28 19:03 ` Thomas Gleixner
  2017-07-05 14:02   ` Chen Yu
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2017-06-28 19:03 UTC (permalink / raw)
  To: Chen Yu
  Cc: x86, LKML, Mika Westerberg, Rui Zhang, Ingo Molnar, Anvin,
	H Peter, Van De Ven, Arjan, Brown, Len, Wysocki, Rafael J,
	Christoph Hellwig

On Sat, 13 May 2017, Chen Yu wrote:
> This is because:
> 1. One of the drivers has declare many vector resource via
>    pci_enable_msix_range(), say, this driver might likely want
>    to reserve 6 per logical CPU, then there would be 192 of them.

This has been solved with the managed interrupt mechanism for multi queue
devices, which does not longer migrate these interrupts. It simply shuts
them down and restarts them when the cpu comes back online.

> 2. Besides, most of the vectors for this driver are allocated
>    on CPU0 due to the current code strategy, so there would be
>    insufficient slots left on CPU0 to receive any migrated
>    IRQs from the other CPUs during CPU offine.

That's a driver sillyness. I assume it's a multi queue device, so it should
use the managed interrupt affinity code.

> 3. Furthermore, many vectors this driver reserved do no have
>    any IRQ handler attached.

That's silly. Why does a driver allocate more resources than required? Just
because it can?

> As a result, all vectors on CPU0 were used out and the last alive
> CPU (31) failed to migrate its IRQs to the CPU0.
> 
> As we might have difficulty to reduce the number of vectors reserved
> by that driver,

Why so? If the vectors are not used, then why are they allocated in the
first place? Can you point at the driver source please?

> there could be a compromising solution that, to spread
> the vector allocation on different CPUs rather than always choosing
> the *first* CPU in the cpumask. In this way, there would be a balanced
> vector distribution. Because many vectors reserved but without used(point 3
> above) will not be counted in during CPU offline, and they are now
> on nonboot CPUs this problem will be solved.

You are tackling the problem at the wrong end to some extent. I'm not
against sanely spreading interrupts, but we only want to do that, if there
is a compelling reason. A broken driver does not count.
  
> +static int pick_leisure_cpu(const struct cpumask *mask)
> +{
> +	int cpu, vector;
> +	int min_nr_vector = NR_VECTORS;
> +	int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> +
> +	for_each_cpu_and(cpu, mask, cpu_online_mask) {
> +		int nr_vectors = 0;
> +
> +		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
> +			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> +				nr_vectors++;
> +		}
> +		if (nr_vectors < min_nr_vector) {
> +			min_nr_vector = nr_vectors;
> +			target_cpu = cpu;
> +		}

That's beyond silly. Why would we have to count the available vectors over
and over? We can keep track of the vectors available per cpu continuously
when we allocate and free them.

> +	}
> +	return target_cpu;
> +}
> +
>  static int __assign_irq_vector(int irq, struct apic_chip_data *d,
>  			       const struct cpumask *mask)
>  {
> @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
>  	/* Only try and allocate irqs on cpus that are present */
>  	cpumask_clear(d->old_domain);
>  	cpumask_clear(searched_cpumask);
> -	cpu = cpumask_first_and(mask, cpu_online_mask);
> +	cpu = pick_leisure_cpu(mask);
>  	while (cpu < nr_cpu_ids) {
>  		int new_cpu, offset;

I'm really not fond of this extra loop. That function is a huge nested loop
mess already. The cpu we select upfront gets fed into
apic->vector_allocation_domain() which might just say no because the CPU
does not belong to it. So yes, it kinda works for your purpose, but it's
just a bolted on 'solution'.

We really need to sit down and rewrite that allocation code from scratch,
use per cpu bitmaps, so we can do fast search over masks and keep track of
the vectors which are used/free per cpu.

That needs some thought because we need to respect the allocation domains,
but there must be a saner way than looping with interrupts disabled for a
gazillion hoops.

What the current code misses completely is to set a preference for the node
on which the interrupt was allocated. We just blindly take something out of
the default affinity mask which is all online cpus.

So there are a lot of shortcomings in the current implementation and we
should be able to replace it with something better, faster and while at it
make that spreading stuff work.

That does not mean, that we blindly support drivers allocating a gazillion
vectors just for nothing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs
  2017-06-28 19:03 ` Thomas Gleixner
@ 2017-07-05 14:02   ` Chen Yu
  0 siblings, 0 replies; 3+ messages in thread
From: Chen Yu @ 2017-07-05 14:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Williams, Mitch A, Ramamurthy, Harshitha, Ingo Molnar,
	Van De Ven, Arjan, Anvin, H Peter, Rafael J. Wysocki, Len Brown,
	Rui Zhang, Mika Westerberg, linux-kernel, x86

Hi Thomas,
On Wed, Jun 28, 2017 at 09:03:19PM +0200, Thomas Gleixner wrote:
> On Sat, 13 May 2017, Chen Yu wrote:
> > This is because:
> > 1. One of the drivers has declare many vector resource via
> >    pci_enable_msix_range(), say, this driver might likely want
> >    to reserve 6 per logical CPU, then there would be 192 of them.
> 
> This has been solved with the managed interrupt mechanism for multi queue
> devices, which does not longer migrate these interrupts. It simply shuts
> them down and restarts them when the cpu comes back online.
> 
I think the driver has already enabled multi queue:
ethtool -l enp187s0f0
Channel parameters for enp187s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		1
Combined:	128
Current hardware settings:
RX:		0
TX:		0
Other:		1
Combined:	32
> > 2. Besides, most of the vectors for this driver are allocated
> >    on CPU0 due to the current code strategy, so there would be
> >    insufficient slots left on CPU0 to receive any migrated
> >    IRQs from the other CPUs during CPU offine.
> 
> That's a driver sillyness. I assume it's a multi queue device, so it should
> use the managed interrupt affinity code.
> 
Let me Cc my colleagues Mitch and Harshitha who are working on
this driver currently.
> > 3. Furthermore, many vectors this driver reserved do no have
> >    any IRQ handler attached.
> 
> That's silly. Why does a driver allocate more resources than required? Just
> because it can?
> 
The reason why the driver has reserved the resource but w/o using them
might be that(well, it is just my guess), some network cable are not
plugged in the interface at the moment, although the driver has been
loaded. I will confirm if this is the case first.
> > As a result, all vectors on CPU0 were used out and the last alive
> > CPU (31) failed to migrate its IRQs to the CPU0.
> > 
> > As we might have difficulty to reduce the number of vectors reserved
> > by that driver,
> 
> Why so? If the vectors are not used, then why are they allocated in the
> first place? Can you point at the driver source please?
> 
We are testing the net/ethernet/intel/i40e driver now. Previously we have
introduced a mechanism to release some vectors before S4 thus to work
around this issue, but as you mentioned before, if these are multi-queues,
the release will be accomplisheded automatically during cpu hotplug, I wonder
if the vectors reserved will also be released even without handler attached?
> > there could be a compromising solution that, to spread
> > the vector allocation on different CPUs rather than always choosing
> > the *first* CPU in the cpumask. In this way, there would be a balanced
> > vector distribution. Because many vectors reserved but without used(point 3
> > above) will not be counted in during CPU offline, and they are now
> > on nonboot CPUs this problem will be solved.
> 
> You are tackling the problem at the wrong end to some extent. I'm not
> against sanely spreading interrupts, but we only want to do that, if there
> is a compelling reason. A broken driver does not count.
>   
Understand.
> > +static int pick_leisure_cpu(const struct cpumask *mask)
> > +{
> > +	int cpu, vector;
> > +	int min_nr_vector = NR_VECTORS;
> > +	int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> > +
> > +	for_each_cpu_and(cpu, mask, cpu_online_mask) {
> > +		int nr_vectors = 0;
> > +
> > +		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
> > +			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> > +				nr_vectors++;
> > +		}
> > +		if (nr_vectors < min_nr_vector) {
> > +			min_nr_vector = nr_vectors;
> > +			target_cpu = cpu;
> > +		}
> 
> That's beyond silly. Why would we have to count the available vectors over
> and over? We can keep track of the vectors available per cpu continuously
> when we allocate and free them.
> 
Yes, that would be much more efficient:)
> > +	}
> > +	return target_cpu;
> > +}
> > +
> >  static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> >  			       const struct cpumask *mask)
> >  {
> > @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> >  	/* Only try and allocate irqs on cpus that are present */
> >  	cpumask_clear(d->old_domain);
> >  	cpumask_clear(searched_cpumask);
> > -	cpu = cpumask_first_and(mask, cpu_online_mask);
> > +	cpu = pick_leisure_cpu(mask);
> >  	while (cpu < nr_cpu_ids) {
> >  		int new_cpu, offset;
> 
> I'm really not fond of this extra loop. That function is a huge nested loop
> mess already. The cpu we select upfront gets fed into
> apic->vector_allocation_domain() which might just say no because the CPU
> does not belong to it. So yes, it kinda works for your purpose, but it's
> just a bolted on 'solution'.
> 
> We really need to sit down and rewrite that allocation code from scratch,
> use per cpu bitmaps, so we can do fast search over masks and keep track of
> the vectors which are used/free per cpu.
> 
> That needs some thought because we need to respect the allocation domains,
> but there must be a saner way than looping with interrupts disabled for a
> gazillion hoops.
> 
> What the current code misses completely is to set a preference for the node
> on which the interrupt was allocated. We just blindly take something out of
> the default affinity mask which is all online cpus.
> 
> So there are a lot of shortcomings in the current implementation and we
> should be able to replace it with something better, faster and while at it
> make that spreading stuff work.
Agree. Sounds to be a good improvement, let me check the i40e driver first
and then see if this can be revised according to this design.
> 
> That does not mean, that we blindly support drivers allocating a gazillion
> vectors just for nothing.
> 
Thanks,
	Yu
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-07-05 14:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-13 12:40 [PATCH][RFC] x86/irq: Spread vectors on different CPUs Chen Yu
2017-06-28 19:03 ` Thomas Gleixner
2017-07-05 14:02   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).