linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* vector space exhaustion on 4.14 LTS kernels
@ 2018-11-19 22:35 Josh Hunt
  2018-11-21 13:26 ` Thomas Gleixner
  0 siblings, 1 reply; 2+ messages in thread
From: Josh Hunt @ 2018-11-19 22:35 UTC (permalink / raw)
  To: tglx; +Cc: saeedm, linux-kernel, Ozen, Gurhan

Hi Thomas

We have a class of machines that appear to be exhausting the vector 
space on cpus 0 and 1 which causes some breakage later on when trying to 
set the affinity. The boxes are running the 4.14 LTS kernel.

I instrumented 4.14 and here's what I see:

[   28.328849] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff 
onlinemask:ff,ffffffff vector:0
[   28.329847] __assign_irq_vector: irq:512 cpu:2 vector:222 cfgvect:0 
off:14 old_domain:00,00000000 domain:00,00000000 
vector_search:00,00000004 update
[   28.329847] default_cpu_mask_to_apicid: irq:512 mask:00,00000004
...
[   31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff 
onlinemask:ff,ffffffff vector:222
[   31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff 
vector_cpumask:00,00000001 vector:222
...
[   31.729154] __assign_irq_vector: irq:512 cpu:2 vector:00,00000004 
domain:00,00000004 success
[   31.729154] default_cpu_mask_to_apicid: irq:512 hwirq:512 
mask:00,00000004
[   31.729154] apic_set_affinity: irq:512 mask:ff,ffffffff err:0
...
[   32.818152] mlx5_irq_set_affinity_hint: 0: irq:512 mask:00,00000001
...
[   39.531242] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001 
onlinemask:ff,ffffffff vector:222
[   39.531244] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001 
vector_cpumask:00,00000001 vector:222
[   39.531245] __assign_irq_vector: irq:512 cpu:0 vector:00,00000001 
domain:00,00000004
...
[   39.531384] __assign_irq_vector: irq:512 cpu:0 vector:37 
current_vector:37 next_cpu2
[   39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,00000001 
vector:00,00000000 continue
[   39.531386] apic_set_affinity: irq:512 mask:00,00000001 err:-28

The affinity values:

root@172.25.48.208:/proc/irq/512# grep . *
affinity_hint:00,00000001
effective_affinity:00,00000004
effective_affinity_list:2
grep: mlx5_comp0@pci:0000:65:00.1: Is a directory
node:0
smp_affinity:ff,ffffffff
smp_affinity_list:0-39
spurious:count 3
spurious:unhandled 0
spurious:last_unhandled 0 ms

I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on 
allocation", and this sounds like what we're hitting. Booting 4.19 does 
not have this problem. I haven't booted 4.15 yet, but can do it to 
confirm the above commit is what resolves this.

Since 4.14 doesn't have the matrix allocator it's not a trivial 
backport. I was wondering a) if you agree with my assessment and b) if 
there's any plans on resolving this on the 4.14 allocator? If not I can 
attempt to backport the idea to 4.14 to spread the interrupts around on 
allocation.

Thanks
Josh








^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: vector space exhaustion on 4.14 LTS kernels
  2018-11-19 22:35 vector space exhaustion on 4.14 LTS kernels Josh Hunt
@ 2018-11-21 13:26 ` Thomas Gleixner
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Gleixner @ 2018-11-21 13:26 UTC (permalink / raw)
  To: Josh Hunt; +Cc: saeedm, linux-kernel, Ozen, Gurhan

Josh,

On Mon, 19 Nov 2018, Josh Hunt wrote:
> We have a class of machines that appear to be exhausting the vector space on
> cpus 0 and 1 which causes some breakage later on when trying to set the
> affinity. The boxes are running the 4.14 LTS kernel.
> 
> [   39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,00000001
> vector:00,00000000 continue
> [   39.531386] apic_set_affinity: irq:512 mask:00,00000001 err:-28
> 
> The affinity values:
> 
> root@172.25.48.208:/proc/irq/512# grep . *
> affinity_hint:00,00000001
> effective_affinity:00,00000004
> effective_affinity_list:2
> grep: mlx5_comp0@pci:0000:65:00.1: Is a directory
> node:0
> smp_affinity:ff,ffffffff
> smp_affinity_list:0-39
> spurious:count 3
> spurious:unhandled 0
> spurious:last_unhandled 0 ms
> 
> I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on
> allocation", and this sounds like what we're hitting. Booting 4.19 does not
> have this problem. I haven't booted 4.15 yet, but can do it to confirm the
> above commit is what resolves this.

Might be, but in 4.15 the while vector allocation got rewritten. One of the
reasons was the exhaustion issue. Some of that is caused by massive over
allocation by certain device drivers. The new allocator mechanism handles
that way better.

> Since 4.14 doesn't have the matrix allocator it's not a trivial backport. I
> was wondering a) if you agree with my assessment and b) if there's any plans
> on resolving this on the 4.14 allocator? If not I can attempt to backport the
> idea to 4.14 to spread the interrupts around on allocation.

No plans. Good luck with trying to fix that on the 4.14 code. I'd recommend
to switch to 4.19 LTS :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2018-11-21 13:26 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-19 22:35 vector space exhaustion on 4.14 LTS kernels Josh Hunt
2018-11-21 13:26 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).