All of lore.kernel.org
 help / color / mirror / Atom feed
* WARNING and PANIC in irq_matrix_free
@ 2018-02-20 12:07 Tariq Toukan
  2018-02-20 18:11 ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Tariq Toukan @ 2018-02-20 12:07 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel; +Cc: Maor Gottlieb

Hi Thomas,

We started seeing new issues in our net-device daily regression tests.
They are related to patch [1] introduced in kernel 4.15-rc1.

We frequently see a warning in dmesg [2]. Repro is not consistent, we 
tried to narrow it down to a smaller run but couldn't.

In addition, sometimes (less frequent) the warning is followed by a 
panic [3].

I can share all needed details to help analyze this bug.
If you suspect specific flows, we can do an educated narrow down.

Regards,
Tariq


[1] 2f75d9e1c905 genirq: Implement bitmap matrix allocator

[2]
[ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 
irq_matrix_free+0x30/0xd0
[ 8664.891905] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink macvlan vxlan ip6_udp_tunnel udp_tunnel 8021q 
garp mrp stp llc mst_pciconf(OE) nfsv3 nfs fscache netconsole dm_mirror 
dm_region_hash dm_log dm_mod dax kvm_intel kvm irqbypass pcspkr 
i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables 
ata_generic cirrus drm_kms_helper syscopyarea sysfillrect pata_acpi 
sysimgblt fb_sys_fops ttm drm e1000 serio_raw virtio_console i2c_core 
floppy ata_piix [last unloaded: mst_pci]
[ 8664.905117] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8664.907613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
[ 8664.912624] RSP: 0018:ffff88023fd43f70 EFLAGS: 00010002
[ 8664.915149] RAX: 0000000000026318 RBX: ffff880157a77ec0 RCX: 
0000000000000000
[ 8664.917679] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 
ffff880237038400
[ 8664.920244] RBP: ffff880237038400 R08: 00000000e8ba3c69 R09: 
0000000000000000
[ 8664.922813] R10: 00000000000003ff R11: 0000000000000ad9 R12: 
ffff88023fc40000
[ 8664.925345] R13: 0000000000000000 R14: 0000000000000001 R15: 
000000000000002b
[ 8664.927872] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) 
knlGS:0000000000000000
[ 8664.930455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8664.932996] CR2: 0000000000f2c030 CR3: 000000000220a000 CR4: 
00000000000006e0
[ 8664.935557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 8664.938051] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 8664.940541] Call Trace:
[ 8664.942980]  <IRQ>
[ 8664.945399]  free_moved_vector+0x4e/0x100
[ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
[ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
[ 8664.952480]  </IRQ>
[ 8664.954800] RIP: 0010:native_safe_halt+0x2/0x10
[ 8664.957052] RSP: 0018:ffffc90000ccfee0 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffffdf
[ 8664.959186] RAX: ffffffff818ab6e0 RBX: ffff880236233f00 RCX: 
0000000000000000
[ 8664.960499] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 8664.961774] RBP: 0000000000000005 R08: 0000000000000000 R09: 
0000000000000000
[ 8664.963048] R10: 00000000000003ff R11: 0000000000000ad9 R12: 
ffff880236233f00
[ 8664.964345] R13: ffff880236233f00 R14: 0000000000000000 R15: 
0000000000000000
[ 8664.965579]  ? __cpuidle_text_start+0x8/0x8
[ 8664.966808]  default_idle+0x18/0xf0
[ 8664.968040]  do_idle+0x150/0x1d0
[ 8664.969249]  cpu_startup_entry+0x19/0x20
[ 8664.970477]  start_secondary+0x133/0x170
[ 8664.971700]  secondary_startup_64+0xa5/0xb0
[ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 
d5 60 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 
72 0b <0f> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74
[ 8664.975420] ---[ end trace 8be4ba51cd83f4bd ]---


[3]
[ 8943.038767] BUG: unable to handle kernel paging request at 
000000037a6b561b
[ 8943.040114] IP: free_moved_vector+0x61/0x100
[ 8943.041531] PGD 0 P4D 0
[ 8943.042855] Oops: 0002 [#1] SMP PTI
[ 8943.044128] Modules linked in: bonding rdma_ucm ib_ucm rdma_cm iw_cm 
ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core mlxfw mlx4_ib ib_core 
mlx4_en mlx4_core devlink iptable_filter fuse btrfs xor zstd_decompress 
zstd_compress xxhash raid6_pq vfat msdos fat binfmt_misc bridge macvlan 
vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc mst_pciconf(OE) 
nfsv3 nfs fscache netconsole dm_mirror dm_region_hash dm_log dm_mod dax 
kvm_intel kvm irqbypass pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd 
grace sunrpc ip_tables ata_generic cirrus drm_kms_helper syscopyarea 
sysfillrect pata_acpi sysimgblt fb_sys_fops ttm drm e1000 serio_raw 
virtio_console i2c_core floppy ata_piix [last unloaded: mst_pci]
[ 8943.052038] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W  OE 
4.15.0-for-upstream-perf-2018-02-08_07-00-42-18 #1
[ 8943.053350] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
[ 8943.055940] RSP: 0018:ffff88023fd43fa0 EFLAGS: 00010007
[ 8943.057233] RAX: 000000037a6b561b RBX: ffff880157a77ec0 RCX: 
0000000000000001
[ 8943.058506] RDX: 00000000000155a8 RSI: 00000000000155a8 RDI: 
ffff880237038400
[ 8943.059784] RBP: ffff880157a77ec0 R08: 00000000e8ba3c69 R09: 
0000000000000000
[ 8943.061051] R10: 0000000000000000 R11: 0000000000000000 R12: 
000000007f0c0001
[ 8943.062462] R13: 00000000000155a8 R14: 0000000000000001 R15: 
0000000000cc620d
[ 8943.063726] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) 
knlGS:0000000000000000
[ 8943.064993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8943.066253] CR2: 000000037a6b561b CR3: 000000010badc000 CR4: 
00000000000006e0
[ 8943.067522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 8943.068771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 8943.070029] Call Trace:
[ 8943.071273]  <IRQ>
[ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
[ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
[ 8943.075048]  </IRQ>
[ 8943.076288] RIP: 0010:native_safe_halt+0x2/0x10
[ 8943.077530] RSP: 0018:ffffc90000ccfee0 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffffdf
[ 8943.078795] RAX: ffffffff818ab6e0 RBX: ffff880236233f00 RCX: 
0000000000000000
[ 8943.080077] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 8943.081435] RBP: 0000000000000005 R08: 00000000e8ba3c69 R09: 
0000000000000000
[ 8943.082683] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff880236233f00
[ 8943.083932] R13: ffff880236233f00 R14: 0000000000000000 R15: 
0000000000000000
[ 8943.085185]  ? __cpuidle_text_start+0x8/0x8
[ 8943.086438]  default_idle+0x18/0xf0
[ 8943.087694]  do_idle+0x150/0x1d0
[ 8943.088921]  cpu_startup_entry+0x19/0x20
[ 8943.090163]  start_secondary+0x133/0x170
[ 8943.091402]  secondary_startup_64+0xa5/0xb0
[ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 
89 ee e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 
18 82 <48> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48
[ 8943.095371] RIP: free_moved_vector+0x61/0x100 RSP: ffff88023fd43fa0
[ 8943.096685] CR2: 000000037a6b561b
[ 8943.098120] ---[ end trace 8be4ba51cd83f4c0 ]---
[ 8943.099387] Kernel panic - not syncing: Fatal exception in interrupt
[ 8943.101170] Kernel Offset: disabled
[ 8943.102410] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-20 12:07 WARNING and PANIC in irq_matrix_free Tariq Toukan
@ 2018-02-20 18:11 ` Thomas Gleixner
  2018-02-20 18:18   ` Thomas Gleixner
  2018-05-18 22:41   ` Dmitry Safonov
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2018-02-20 18:11 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: linux-kernel, Maor Gottlieb

On Tue, 20 Feb 2018, Tariq Toukan wrote:

> Hi Thomas,
> 
> We started seeing new issues in our net-device daily regression tests.
> They are related to patch [1] introduced in kernel 4.15-rc1.
> 
> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to
> narrow it down to a smaller run but couldn't.
> 
> In addition, sometimes (less frequent) the warning is followed by a panic [3].

Which might be just a consequence of the issue which triggers the warning.

> I can share all needed details to help analyze this bug.
> If you suspect specific flows, we can do an educated narrow down.

> [2]
> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 irq_matrix_free+0x30/0xd0
> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
> [ 8664.912624] RSP: 0018:ffff88023fd43f70 EFLAGS: 00010002
> [ 8664.915149] RAX: 0000000000026318 RBX: ffff880157a77ec0 RCX: 0000000000000000
> [ 8664.917679] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880237038400
> [ 8664.920244] RBP: ffff880237038400 R08: 00000000e8ba3c69 R09: 0000000000000000
> [ 8664.922813] R10: 00000000000003ff R11: 0000000000000ad9 R12: ffff88023fc40000
> [ 8664.925345] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000002b
> [ 8664.927872] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
> [ 8664.930455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8664.932996] CR2: 0000000000f2c030 CR3: 000000000220a000 CR4: 00000000000006e0
> [ 8664.935557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8664.938051] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8664.940541] Call Trace:
> [ 8664.942980]  <IRQ>
> [ 8664.945399]  free_moved_vector+0x4e/0x100
> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8664.952480]  </IRQ>

So thats:

        if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
                return;

and the disassembly of

> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74

gives:

  1f:	44 39 6f 04          	cmp    %r13d,0x4(%rdi)
  23:	77 06                	ja     0x2b
  25:	44 3b 6f 08          	cmp    0x8(%rdi),%r13d
  29:	72 0b                	jb     0x36
  2b:*	0f ff                	(bad)  		<-- trapping instruction

So the vector to release should be in r13d, which is 0. /me scratches head.

> [ 8943.038767] BUG: unable to handle kernel paging request at 000000037a6b561b
> [ 8943.040114] IP: free_moved_vector+0x61/0x100
> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
> [ 8943.055940] RSP: 0018:ffff88023fd43fa0 EFLAGS: 00010007
> [ 8943.057233] RAX: 000000037a6b561b RBX: ffff880157a77ec0 RCX: 0000000000000001
> [ 8943.058506] RDX: 00000000000155a8 RSI: 00000000000155a8 RDI: ffff880237038400
> [ 8943.059784] RBP: ffff880157a77ec0 R08: 00000000e8ba3c69 R09: 0000000000000000
> [ 8943.061051] R10: 0000000000000000 R11: 0000000000000000 R12: 000000007f0c0001
> [ 8943.062462] R13: 00000000000155a8 R14: 0000000000000001 R15: 0000000000cc620d
> [ 8943.063726] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
> [ 8943.064993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8943.066253] CR2: 000000037a6b561b CR3: 000000010badc000 CR4: 00000000000006e0
> [ 8943.067522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8943.068771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8943.070029] Call Trace:
> [ 8943.071273]  <IRQ>
> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
> [ 8943.075048]  </IRQ>
...
> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48

 17:*	48 c7 00 00 00 00 00 	movq   $0x0,(%rax)		<-- trapping instruction

I can't tell exactly which part of free_moved_vector() that is and which
vector it tries to free, but its more or less the same crap just with a
fatal vector number.

Is there CPU hotplugging in play?

I'll come back to you tomorrow with a plan how to debug that after staring
into the code some more.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-20 18:11 ` Thomas Gleixner
@ 2018-02-20 18:18   ` Thomas Gleixner
  2018-02-21 11:27     ` Tariq Toukan
  2018-05-18 22:41   ` Dmitry Safonov
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-02-20 18:18 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: linux-kernel, Maor Gottlieb

On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
> 
> Is there CPU hotplugging in play?
> 
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

Do you have a rough idea what the test case is doing?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-20 18:18   ` Thomas Gleixner
@ 2018-02-21 11:27     ` Tariq Toukan
  2018-02-22 21:38       ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Tariq Toukan @ 2018-02-21 11:27 UTC (permalink / raw)
  To: Thomas Gleixner, Tariq Toukan; +Cc: linux-kernel, Maor Gottlieb



On 20/02/2018 8:18 PM, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Thomas Gleixner wrote:
>> On Tue, 20 Feb 2018, Tariq Toukan wrote:
>>
>> Is there CPU hotplugging in play?

No.

>>
>> I'll come back to you tomorrow with a plan how to debug that after staring
>> into the code some more.
> 
> Do you have a rough idea what the test case is doing?
> 

It arbitrary appears in different flows, like sending traffic or 
interface configuration changes.

Thanks,
Tariq

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-21 11:27     ` Tariq Toukan
@ 2018-02-22 21:38       ` Thomas Gleixner
  2018-02-25  9:50         ` Tariq Toukan
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-02-22 21:38 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: linux-kernel, Maor Gottlieb

On Wed, 21 Feb 2018, Tariq Toukan wrote:
> On 20/02/2018 8:18 PM, Thomas Gleixner wrote:
> > On Tue, 20 Feb 2018, Thomas Gleixner wrote:
> > > On Tue, 20 Feb 2018, Tariq Toukan wrote:
> > > 
> > > Is there CPU hotplugging in play?
> 
> No.

Ok.

> > > 
> > > I'll come back to you tomorrow with a plan how to debug that after staring
> > > into the code some more.
> > 
> > Do you have a rough idea what the test case is doing?
> > 
> 
> It arbitrary appears in different flows, like sending traffic or interface
> configuration changes.

Hmm. Looks like memory corruption, but I can't pin point it.

Find below a debug patch which should prevent the crash and might give us
some insight into the type of corruption.

Please enable the irq_matrix and vector allocation trace points. 

echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable

When the problem triggers the bogus vector is printed and the trace is
frozen. Please provide dmesg and the tracebuffer output.

Thanks,

	tglx

8<--------------
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -822,6 +822,12 @@ static void free_moved_vector(struct api
 	unsigned int cpu = apicd->prev_cpu;
 	bool managed = apicd->is_managed;
 
+	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {
+		tracing_off();
+		pr_err("Trying to clear prev_vector: %u\n", vector);
+		goto out;
+	}
+
 	/*
 	 * This should never happen. Managed interrupts are not
 	 * migrated except on CPU down, which does not involve the
@@ -833,6 +839,7 @@ static void free_moved_vector(struct api
 	trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 	irq_matrix_free(vector_matrix, cpu, vector, managed);
 	per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
+out:
 	hlist_del_init(&apicd->clist);
 	apicd->prev_vector = 0;
 	apicd->move_in_progress = 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-22 21:38       ` Thomas Gleixner
@ 2018-02-25  9:50         ` Tariq Toukan
  0 siblings, 0 replies; 25+ messages in thread
From: Tariq Toukan @ 2018-02-25  9:50 UTC (permalink / raw)
  To: Thomas Gleixner, Tariq Toukan; +Cc: linux-kernel, Maor Gottlieb



On 22/02/2018 11:38 PM, Thomas Gleixner wrote:
> On Wed, 21 Feb 2018, Tariq Toukan wrote:
>> On 20/02/2018 8:18 PM, Thomas Gleixner wrote:
>>> On Tue, 20 Feb 2018, Thomas Gleixner wrote:
>>>> On Tue, 20 Feb 2018, Tariq Toukan wrote:
>>>>
>>>> Is there CPU hotplugging in play?
>>
>> No.
> 
> Ok.
> 
>>>>
>>>> I'll come back to you tomorrow with a plan how to debug that after staring
>>>> into the code some more.
>>>
>>> Do you have a rough idea what the test case is doing?
>>>
>>
>> It arbitrary appears in different flows, like sending traffic or interface
>> configuration changes.
> 
> Hmm. Looks like memory corruption, but I can't pin point it.
> 
> Find below a debug patch which should prevent the crash and might give us
> some insight into the type of corruption.
> 
> Please enable the irq_matrix and vector allocation trace points.
> 
> echo 1 >/sys/kernel/debug/tracing/events/irq_matrix/enable
> echo 1 >/sys/kernel/debug/tracing/events/irq_vectors/vector*/enable
> 
> When the problem triggers the bogus vector is printed and the trace is
> frozen. Please provide dmesg and the tracebuffer output.
> 

OK, I'm temporarily adding this to the regression internal branch. I'll 
let you know once we have a repro.

Thanks,
Tariq

> Thanks,
> 
> 	tglx
> 
> 8<--------------
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -822,6 +822,12 @@ static void free_moved_vector(struct api
>   	unsigned int cpu = apicd->prev_cpu;
>   	bool managed = apicd->is_managed;
>   
> +	if (vector < FIRST_EXTERNAL_VECTOR || vector >= FIRST_SYSTEM_VECTOR) {
> +		tracing_off();
> +		pr_err("Trying to clear prev_vector: %u\n", vector);
> +		goto out;
> +	}
> +
>   	/*
>   	 * This should never happen. Managed interrupts are not
>   	 * migrated except on CPU down, which does not involve the
> @@ -833,6 +839,7 @@ static void free_moved_vector(struct api
>   	trace_vector_free_moved(apicd->irq, cpu, vector, managed);
>   	irq_matrix_free(vector_matrix, cpu, vector, managed);
>   	per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
> +out:
>   	hlist_del_init(&apicd->clist);
>   	apicd->prev_vector = 0;
>   	apicd->move_in_progress = 0;
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-02-20 18:11 ` Thomas Gleixner
  2018-02-20 18:18   ` Thomas Gleixner
@ 2018-05-18 22:41   ` Dmitry Safonov
  2018-05-18 23:43     ` Dmitry Safonov
  2018-05-19 11:20     ` Thomas Gleixner
  1 sibling, 2 replies; 25+ messages in thread
From: Dmitry Safonov @ 2018-05-18 22:41 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Tariq Toukan, open list, Maor Gottlieb

Hi Thomas, Tariq,

2018-02-20 18:11 GMT+00:00 Thomas Gleixner <tglx@linutronix.de>:
> On Tue, 20 Feb 2018, Tariq Toukan wrote:
>
>> Hi Thomas,
>>
>> We started seeing new issues in our net-device daily regression tests.
>> They are related to patch [1] introduced in kernel 4.15-rc1.
>>
>> We frequently see a warning in dmesg [2]. Repro is not consistent, we tried to
>> narrow it down to a smaller run but couldn't.
>>
>> In addition, sometimes (less frequent) the warning is followed by a panic [3].
>
> Which might be just a consequence of the issue which triggers the warning.
>
>> I can share all needed details to help analyze this bug.
>> If you suspect specific flows, we can do an educated narrow down.
>
>> [2]
>> [ 8664.868564] WARNING: CPU: 5 PID: 0 at kernel/irq/matrix.c:370 irq_matrix_free+0x30/0xd0
>> [ 8664.910144] RIP: 0010:irq_matrix_free+0x30/0xd0
>> [ 8664.912624] RSP: 0018:ffff88023fd43f70 EFLAGS: 00010002
>> [ 8664.915149] RAX: 0000000000026318 RBX: ffff880157a77ec0 RCX: 0000000000000000
>> [ 8664.917679] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff880237038400
>> [ 8664.920244] RBP: ffff880237038400 R08: 00000000e8ba3c69 R09: 0000000000000000
>> [ 8664.922813] R10: 00000000000003ff R11: 0000000000000ad9 R12: ffff88023fc40000
>> [ 8664.925345] R13: 0000000000000000 R14: 0000000000000001 R15: 000000000000002b
>> [ 8664.927872] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
>> [ 8664.930455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8664.932996] CR2: 0000000000f2c030 CR3: 000000000220a000 CR4: 00000000000006e0
>> [ 8664.935557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 8664.938051] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 8664.940541] Call Trace:
>> [ 8664.942980]  <IRQ>
>> [ 8664.945399]  free_moved_vector+0x4e/0x100
>> [ 8664.947787]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8664.950134]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8664.952480]  </IRQ>
>
> So thats:
>
>         if (WARN_ON_ONCE(bit < m->alloc_start || bit >= m->alloc_end))
>                 return;
>
> and the disassembly of
>
>> [ 8664.972909] Code: 41 56 41 89 f6 41 55 41 89 d5 89 f2 41 54 4c 8b 24 d5 60
>> 24 18 82 55 48 89 fd 53 48 8b 47 28 44 39 6f 04 77 06 44 3b 6f 08 72 0b <0f>
>> ff 5b 5d 41 5c 41 5d 41 5e c3 49 01 c4 41 80 7c 24 0c 00 74
>
> gives:
>
>   1f:   44 39 6f 04             cmp    %r13d,0x4(%rdi)
>   23:   77 06                   ja     0x2b
>   25:   44 3b 6f 08             cmp    0x8(%rdi),%r13d
>   29:   72 0b                   jb     0x36
>   2b:*  0f ff                   (bad)           <-- trapping instruction
>
> So the vector to release should be in r13d, which is 0. /me scratches head.
>
>> [ 8943.038767] BUG: unable to handle kernel paging request at 000000037a6b561b
>> [ 8943.040114] IP: free_moved_vector+0x61/0x100
>> [ 8943.054654] RIP: 0010:free_moved_vector+0x61/0x100
>> [ 8943.055940] RSP: 0018:ffff88023fd43fa0 EFLAGS: 00010007
>> [ 8943.057233] RAX: 000000037a6b561b RBX: ffff880157a77ec0 RCX: 0000000000000001
>> [ 8943.058506] RDX: 00000000000155a8 RSI: 00000000000155a8 RDI: ffff880237038400
>> [ 8943.059784] RBP: ffff880157a77ec0 R08: 00000000e8ba3c69 R09: 0000000000000000
>> [ 8943.061051] R10: 0000000000000000 R11: 0000000000000000 R12: 000000007f0c0001
>> [ 8943.062462] R13: 00000000000155a8 R14: 0000000000000001 R15: 0000000000cc620d
>> [ 8943.063726] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
>> [ 8943.064993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8943.066253] CR2: 000000037a6b561b CR3: 000000010badc000 CR4: 00000000000006e0
>> [ 8943.067522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 8943.068771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 8943.070029] Call Trace:
>> [ 8943.071273]  <IRQ>
>> [ 8943.072503]  smp_irq_move_cleanup_interrupt+0x89/0x9e
>> [ 8943.073794]  irq_move_cleanup_interrupt+0x95/0xa0
>> [ 8943.075048]  </IRQ>
> ...
>> [ 8943.092659] Code: 44 00 00 48 8b 3d c8 f7 9f 01 44 89 f1 44 89 e2 44 89 ee
>> e8 e2 05 0b 00 48 c7 c0 20 50 01 00 4a 8d 04 e0 4a 03 04 ed 60 24 18 82 <48>
>> c7 00 00 00 00 00 48 8b 45 28 48 85 c0 74 20 48 8b 55 20 48
>
>  17:*   48 c7 00 00 00 00 00    movq   $0x0,(%rax)              <-- trapping instruction
>
> I can't tell exactly which part of free_moved_vector() that is and which
> vector it tries to free, but its more or less the same crap just with a
> fatal vector number.
>
> Is there CPU hotplugging in play?
>
> I'll come back to you tomorrow with a plan how to debug that after staring
> into the code some more.

I'm not entirely sure that it's the same fault, but at least backtrace
looks resembling.
I've got it in VM on the last master: v4.17-rc5-110-g2c71d338bef2
Let me know if you need .config or CPU/qemu details or anything.
I'm not sure if it'll reproduce, but might still worth to paste it here:

[root@localhost self]# [  125.240084] WARNING: CPU: 1 PID: 2198 at
kernel/irq/matrix.c:371 irq_matrix_free+0x20/0xc0
[  125.242306] Modules linked in:
[  125.242709] CPU: 1 PID: 2198 Comm: sshd Not tainted 4.17.0-rc5+ #11
[  125.242949] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1.fc26 04/01/2014
[  125.243342] RIP: 0010:irq_matrix_free+0x20/0xc0
[  125.243517] RSP: 0000:ffff926e7fd03e30 EFLAGS: 00000002
[  125.243737] RAX: 0000000000000001 RBX: ffff926e7fd24700 RCX: 0000000000000000
[  125.243993] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff926e7d108400
[  125.244249] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  125.244503] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  125.244761] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  125.245049] FS:  00007ffb4f9e74c0(0000) GS:ffff926e7fd00000(0000)
knlGS:0000000000000000
[  125.245331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  125.245540] CR2: 0000564a977e9658 CR3: 00000000717fe000 CR4: 00000000000006e0
[  125.245839] Call Trace:
[  125.246308]  <IRQ>
[  125.246571]  free_moved_vector+0x43/0x100
[  125.246765]  smp_irq_move_cleanup_interrupt+0xa4/0xb9
[  125.246959]  irq_move_cleanup_interrupt+0xc/0x20
[  125.247218] RIP: 0010:__do_softirq+0xa3/0x324
[  125.247382] RSP: 0000:ffff926e7fd03f78 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffffdf
[  125.247655] RAX: 00000000ffffffff RBX: ffff926e6ca6b300 RCX: 0000000000000020
[  125.247911] RDX: 0000000000000000 RSI: 0000000080000000 RDI: ffffffff9566211b
[  125.248156] RBP: 0000000000000000 R08: 00000000000f4200 R09: 0000001d27c0d340
[  125.248403] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  125.248650] R13: 0000000000000000 R14: 0000000000000200 R15: 0000000000000000
[  125.248929]  ? irq_move_cleanup_interrupt+0x7/0x20
[  125.249129]  ? irq_exit+0xbb/0xc0
[  125.249285]  ? __do_softirq+0x9d/0x324
[  125.249432]  ? hrtimer_interrupt+0x118/0x240
[  125.249606]  irq_exit+0xbb/0xc0
[  125.249751]  smp_apic_timer_interrupt+0x6c/0x120
[  125.249932]  apic_timer_interrupt+0xf/0x20
[  125.250122]  </IRQ>
[  125.250318] RIP: 0033:0x7ffb4cdb2cdd
[  125.250456] RSP: 002b:00007fff031177e0 EFLAGS: 00000206 ORIG_RAX:
ffffffffffffff13
[  125.250722] RAX: 0000564a975a13ee RBX: 0000564a975a13e7 RCX: 0000000000000004
[  125.250966] RDX: 0000000000000074 RSI: 0000000000000025 RDI: 0000564a975a13ee
[  125.251211] RBP: 00007fff03117d50 R08: 0000564a975a13ee R09: 00000000fffffffc
[  125.251457] R10: 0000564a98c7a4f0 R11: 0000000000000004 R12: 00007fff03117ea0
[  125.251700] R13: 0000000000000009 R14: 00007fff03117d60 R15: 00007ffb4d12f4c0
[  125.251977] Code: 98 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
89 f0 41 56 41 55 41 54 55 53 48 8b 5f 28 48 03 1c c5 20 14 6b 96 39
57 04 76 0b <0f> 0b 5b 5d 41 5c 41 5d 41 5e c3 39 57 08 76 f0 89 d0 f0
48 0f
[  125.253132] ---[ end trace 47657dc74f8ba53a ]---

What I did - was running x86 selftests and ssh'ing at the same moment.
Will try it again if it'll fire by any chance.

Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-18 22:41   ` Dmitry Safonov
@ 2018-05-18 23:43     ` Dmitry Safonov
  2018-05-19 11:20     ` Thomas Gleixner
  1 sibling, 0 replies; 25+ messages in thread
From: Dmitry Safonov @ 2018-05-18 23:43 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Tariq Toukan, open list, Maor Gottlieb

2018-05-18 23:41 GMT+01:00 Dmitry Safonov <0x7f454c46@gmail.com>:
> What I did - was running x86 selftests and ssh'ing at the same moment.
> Will try it again if it'll fire by any chance.

No, I've tried a couple of times (after reboot as it's ONCE), but it looks
hard to reproduce by hands without some automation or something.

Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-18 22:41   ` Dmitry Safonov
  2018-05-18 23:43     ` Dmitry Safonov
@ 2018-05-19 11:20     ` Thomas Gleixner
  2018-05-23  7:16       ` Tariq Toukan
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-19 11:20 UTC (permalink / raw)
  To: Dmitry Safonov; +Cc: Tariq Toukan, open list, Maor Gottlieb

On Fri, 18 May 2018, Dmitry Safonov wrote:
> I'm not entirely sure that it's the same fault, but at least backtrace
> looks resembling.

Yes, it's similar, but not the same issue. I'll stare are the code ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-19 11:20     ` Thomas Gleixner
@ 2018-05-23  7:16       ` Tariq Toukan
  2018-05-23  8:49         ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Tariq Toukan @ 2018-05-23  7:16 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Safonov; +Cc: Tariq Toukan, open list, Maor Gottlieb



On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
> On Fri, 18 May 2018, Dmitry Safonov wrote:
>> I'm not entirely sure that it's the same fault, but at least backtrace
>> looks resembling.
> 
> Yes, it's similar, but not the same issue. I'll stare are the code ...
> 
> Thanks,
> 
> 	tglx
> 

We still see the issue in our daily regression runs.
I have your patch merged into my internal branch, it prints the following:

[ 4898.226258] Trying to clear prev_vector: 0
[ 4898.226439] Trying to clear prev_vector: 0

i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-23  7:16       ` Tariq Toukan
@ 2018-05-23  8:49         ` Thomas Gleixner
  2018-05-25 20:10           ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-23  8:49 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: Dmitry Safonov, open list, Maor Gottlieb

On Wed, 23 May 2018, Tariq Toukan wrote:
> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
> > On Fri, 18 May 2018, Dmitry Safonov wrote:
> > > I'm not entirely sure that it's the same fault, but at least backtrace
> > > looks resembling.
> > 
> > Yes, it's similar, but not the same issue. I'll stare are the code ...
> > 
> > Thanks,
> > 
> > 	tglx
> > 
> 
> We still see the issue in our daily regression runs.
> I have your patch merged into my internal branch, it prints the following:
> 
> [ 4898.226258] Trying to clear prev_vector: 0
> [ 4898.226439] Trying to clear prev_vector: 0
> 
> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.

Could you please enable the vector and irq matrix trace points and capture
the trace when this happens?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-23  8:49         ` Thomas Gleixner
@ 2018-05-25 20:10           ` Song Liu
  2018-05-25 21:29             ` Song Liu
  2018-05-28 10:53             ` Thomas Gleixner
  0 siblings, 2 replies; 25+ messages in thread
From: Song Liu @ 2018-05-25 20:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb, kernel-team

Hi,

We are seeing something probably related.

We run ethtool on a system with Broadcom NIC to increase number of
combined queues.


    [root@ ~]# ethtool -l eth0
    Channel parameters for eth0:
    Pre-set maximums:
    RX:             9
    TX:             8
    Other:          0
    Combined:       17
    Current hardware settings:
    RX:             0
    TX:             0
    Other:          0
    Combined:       8

    [root@ ~]# ethtool -L eth0 combined 16

The last command PANIC the kernel easily (5 out of 5 in my tests).

I haven't got luck to catch much console output, the only line I got is:

    [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
irq_matrix_free+0x32/0xd0

The NIC we have is

    Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller

Thanks,
Song

On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 23 May 2018, Tariq Toukan wrote:
>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>> > > looks resembling.
>> >
>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>> >
>> > Thanks,
>> >
>> >     tglx
>> >
>>
>> We still see the issue in our daily regression runs.
>> I have your patch merged into my internal branch, it prints the following:
>>
>> [ 4898.226258] Trying to clear prev_vector: 0
>> [ 4898.226439] Trying to clear prev_vector: 0
>>
>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>
> Could you please enable the vector and irq matrix trace points and capture
> the trace when this happens?
>
> Thanks,
>
>         tglx
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-25 20:10           ` Song Liu
@ 2018-05-25 21:29             ` Song Liu
  2018-05-28 10:53             ` Thomas Gleixner
  1 sibling, 0 replies; 25+ messages in thread
From: Song Liu @ 2018-05-25 21:29 UTC (permalink / raw)
  To: Thomas Gleixner, michael.chan, gospo, vasundhara-v.volam
  Cc: Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb, kernel-team

Adding Broadcom developers to the thread.

Hi Michael, Vasundhara, and Andy,

Could you please help look into the case I found? ethtool crashes the system
for both net/master and net-next/master.

Thanks,
Song

On Fri, May 25, 2018 at 1:10 PM, Song Liu <liu.song.a23@gmail.com> wrote:
> Hi,
>
> We are seeing something probably related.
>
> We run ethtool on a system with Broadcom NIC to increase number of
> combined queues.
>
>
>     [root@ ~]# ethtool -l eth0
>     Channel parameters for eth0:
>     Pre-set maximums:
>     RX:             9
>     TX:             8
>     Other:          0
>     Combined:       17
>     Current hardware settings:
>     RX:             0
>     TX:             0
>     Other:          0
>     Combined:       8
>
>     [root@ ~]# ethtool -L eth0 combined 16
>
> The last command PANIC the kernel easily (5 out of 5 in my tests).
>
> I haven't got luck to catch much console output, the only line I got is:
>
>     [ 504.727865] WARNING: CPU: 10 PID: 0 at kernel/irq/matrix.c:371
> irq_matrix_free+0x32/0xd0
>
> The NIC we have is
>
>     Broadcom Limited BCM57302 NetXtreme-C 10Gb/25Gb Ethernet Controller
>
> Thanks,
> Song
>
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>> On 19/05/2018 2:20 PM, Thomas Gleixner wrote:
>>> > On Fri, 18 May 2018, Dmitry Safonov wrote:
>>> > > I'm not entirely sure that it's the same fault, but at least backtrace
>>> > > looks resembling.
>>> >
>>> > Yes, it's similar, but not the same issue. I'll stare are the code ...
>>> >
>>> > Thanks,
>>> >
>>> >     tglx
>>> >
>>>
>>> We still see the issue in our daily regression runs.
>>> I have your patch merged into my internal branch, it prints the following:
>>>
>>> [ 4898.226258] Trying to clear prev_vector: 0
>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>
>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>
>> Could you please enable the vector and irq matrix trace points and capture
>> the trace when this happens?
>>
>> Thanks,
>>
>>         tglx
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-25 20:10           ` Song Liu
  2018-05-25 21:29             ` Song Liu
@ 2018-05-28 10:53             ` Thomas Gleixner
  2018-05-28 11:17               ` Tariq Toukan
  2018-05-28 18:34               ` Song Liu
  1 sibling, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-28 10:53 UTC (permalink / raw)
  To: Song Liu
  Cc: Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb, kernel-team

On Fri, 25 May 2018, Song Liu wrote:
> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 23 May 2018, Tariq Toukan wrote:
> >> I have your patch merged into my internal branch, it prints the following:
> >>
> >> [ 4898.226258] Trying to clear prev_vector: 0
> >> [ 4898.226439] Trying to clear prev_vector: 0
> >>
> >> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> >
> > Could you please enable the vector and irq matrix trace points and capture
> > the trace when this happens?

Does the patch below fix it?

Thanks,

	tglx

8<-------------------
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb6f7a2148d7..54af3d4884b1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec,
 	 * prev_vector for this and the offlined target case.
 	 */
 	apicd->prev_vector = 0;
+	apicd->move_in_progress = false;
 	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
 		goto setnew;
 	/*

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 10:53             ` Thomas Gleixner
@ 2018-05-28 11:17               ` Tariq Toukan
  2018-05-28 14:27                 ` Thomas Gleixner
  2018-05-28 18:34               ` Song Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Tariq Toukan @ 2018-05-28 11:17 UTC (permalink / raw)
  To: Thomas Gleixner, Song Liu
  Cc: Dmitry Safonov, open list, Maor Gottlieb, kernel-team



On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
> On Fri, 25 May 2018, Song Liu wrote:
>> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>>> I have your patch merged into my internal branch, it prints the following:
>>>>
>>>> [ 4898.226258] Trying to clear prev_vector: 0
>>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>>
>>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>>
>>> Could you please enable the vector and irq matrix trace points and capture
>>> the trace when this happens?
> 
> Does the patch below fix it?
> 
> Thanks,
> 
> 	tglx
> 
> 8<-------------------
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index bb6f7a2148d7..54af3d4884b1 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec,
>   	 * prev_vector for this and the offlined target case.
>   	 */
>   	apicd->prev_vector = 0;
> +	apicd->move_in_progress = false;
>   	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>   		goto setnew;
>   	/*
> 

I took it into my internal branch. Will let you know.

Tariq

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 11:17               ` Tariq Toukan
@ 2018-05-28 14:27                 ` Thomas Gleixner
  2018-05-28 18:36                   ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-28 14:27 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Song Liu, Dmitry Safonov, open list, Maor Gottlieb, kernel-team

On Mon, 28 May 2018, Tariq Toukan wrote:
> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
> > On Fri, 25 May 2018, Song Liu wrote:
> > > On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de>
> > > wrote:
> > > > On Wed, 23 May 2018, Tariq Toukan wrote:
> > > > > I have your patch merged into my internal branch, it prints the
> > > > > following:
> > > > > 
> > > > > [ 4898.226258] Trying to clear prev_vector: 0
> > > > > [ 4898.226439] Trying to clear prev_vector: 0
> > > > > 
> > > > > i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
> > > > 
> > > > Could you please enable the vector and irq matrix trace points and
> > > > capture
> > > > the trace when this happens?
> > 
> > Does the patch below fix it?
> > 
> > Thanks,
> > 
> > 	tglx
> > 
> > 8<-------------------
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index bb6f7a2148d7..54af3d4884b1 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
> > unsigned int newvec,
> >   	 * prev_vector for this and the offlined target case.
> >   	 */
> >   	apicd->prev_vector = 0;
> > +	apicd->move_in_progress = false;
> >   	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
> >   		goto setnew;
> >   	/*
> > 
> 
> I took it into my internal branch. Will let you know.

Note, I'd still like to see a trace w/o the patch which shows which order
of events leads to this. Even if the patch cures it, it might just paper
over it and not fixing the root cause.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 10:53             ` Thomas Gleixner
  2018-05-28 11:17               ` Tariq Toukan
@ 2018-05-28 18:34               ` Song Liu
  2018-05-28 20:09                 ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Song Liu @ 2018-05-28 18:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team



> On May 28, 2018, at 3:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, 25 May 2018, Song Liu wrote:
>> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>>> I have your patch merged into my internal branch, it prints the following:
>>>> 
>>>> [ 4898.226258] Trying to clear prev_vector: 0
>>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>> 
>>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>> 
>>> Could you please enable the vector and irq matrix trace points and capture
>>> the trace when this happens?
> 
> Does the patch below fix it?
> 
> Thanks,
> 
> 	tglx
> 
> 8<-------------------
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index bb6f7a2148d7..54af3d4884b1 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd, unsigned int newvec,
> 	 * prev_vector for this and the offlined target case.
> 	 */
> 	apicd->prev_vector = 0;
> +	apicd->move_in_progress = false;
> 	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
> 		goto setnew;
> 	/*

This doesn't fix the issue with bnxt. Here is a trace with this patch:

[  569.222495] WARNING: CPU: 20 PID: 0 at kernel/irq/matrix.c:373 irq_matrix_free+0x32/0xd0
[  569.238811] BUG: unable to handle kernel
[  569.238811] NULL pointer dereference
[  569.238812]  at 0000000000000000
[  569.238812] IP: bnxt_poll+0x163/0x830
[  569.238812] PGD 0
[  569.238812] P4D 0
[  569.238813] Oops: 0002 [#1] SMP PTI
[  569.238813] Modules linked in:
[  569.238813]  nfsv3
[  569.238814]  nfs
[  569.238814]  fscache
[  569.238814]  ip6table_raw
[  569.238814]  ip6table_mangle
[  569.238815]  iptable_raw
[  569.238815]  iptable_mangle
[  569.238815]  ip6table_filter
[  569.238816]  xt_NFLOG
[  569.238816]  xt_comment
[  569.238816]  xt_statistic
[  569.238816]  iptable_filter
[  569.238817]  nfnetlink_log
[  569.238817]  tcp_diag
[  569.238817]  inet_diag
[  569.238817]  sb_edac
[  569.238818]  x86_pkg_temp_thermal
[  569.238818]  intel_powerclamp
[  569.238818]  coretemp
[  569.238818]  kvm_intel
[  569.238818]  kvm
[  569.238819]  irqbypass
[  569.238819]  iTCO_wdt
[  569.238819]  iTCO_vendor_support
[  569.238819]  lpc_ich
[  569.238819]  efivars
[  569.238820]  mfd_core
[  569.238820]  i2c_i801
[  569.238820]  ipmi_si
[  569.238820]  ipmi_devintf
[  569.238820]  ipmi_msghandler
[  569.238821]  button
[  569.238821]  acpi_cpufreq
[  569.238821]  sch_fq_codel
[  569.238821]  nfsd
[  569.238821]  nfs_acl
[  569.238822]  lockd
[  569.238822]  auth_rpcgss
[  569.238822]  oid_registry
[  569.238822]  grace
[  569.238822]  sunrpc
[  569.238823]  fuse
[  569.238823]  loop
[  569.238823]  efivarfs
[  569.238823]  autofs4
[  569.238824] CPU: 20 PID: 0 Comm: swapper/20 Not tainted 4.16.0-00391-g3742c6a #813
[  569.238824] Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B12 08/17/2017
[  569.238824] RIP: 0010:bnxt_poll+0x163/0x830
[  569.238825] RSP: 0018:ffff883ffef83b18 EFLAGS: 00010006
[  569.238825] RAX: 000000002c000000 RBX: 0000000000000000 RCX: 0000000000000000
[  569.238825] RDX: 0000000000000000 RSI: ffff881fdc53c000 RDI: ffff881f55a1da80
[  569.238826] RBP: ffff883ffef83b60 R08: 0000000000000011 R09: 000000004fbf4c8d
[  569.238826] R10: ffff883ffef83ab0 R11: ffff883fec6eee02 R12: ffff881ff6be2780
[  569.238826] R13: 0000000000000000 R14: ffff881f55a1da80 R15: 0000000000000000
[  569.238827] FS:  0000000000000000(0000) GS:ffff883ffef80000(0000) knlGS:0000000000000000
[  569.238827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  569.238827] CR2: 0000000000000000 CR3: 000000000220a003 CR4: 00000000003606e0
[  569.238827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  569.238828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  569.238828] Call Trace:
[  569.238828]  <IRQ>
[  569.238828]  ? __skb_tx_hash+0x94/0xb0
[  569.238829]  netpoll_poll_dev+0xc5/0x1a0
[  569.238829]  netpoll_send_skb_on_dev+0x12c/0x200
[  569.238829]  netpoll_send_udp+0x2d5/0x410
[  569.238829]  write_ext_msg+0x1e7/0x200
[  569.238830]  ? scnprintf+0x3a/0x70
[  569.238830]  console_unlock+0x35c/0x530
[  569.238830]  vprintk_emit+0x225/0x340
[  569.238830]  ? irq_matrix_free+0x32/0xd0
[  569.238831]  vprintk_default+0x1f/0x30
[  569.238831]  vprintk_func+0x35/0x70
[  569.238831]  printk+0x43/0x4b
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  __warn+0x6f/0x150
[  569.238832]  ? irq_matrix_free+0x32/0xd0
[  569.238832]  report_bug+0x83/0xe0
[  569.238833]  do_invalid_op+0x2c/0x70
[  569.238833]  invalid_op+0x1b/0x40
[  569.238833] RIP: 0010:irq_matrix_free+0x32/0xd0
[  569.238834] RSP: 0018:ffff883ffef83f58 EFLAGS: 00010002
[  569.238834] RAX: 0000000000000014 RBX: 0000000000014140 RCX: 0000000000000000
[  569.238835] RDX: 0000000000000000 RSI: 0000000000000014 RDI: ffff881fff028c00
[  569.238835] RBP: ffff883ffef83fc0 R08: ffff883ffefa1c40 R09: 0000000000000000
[  569.238836] R10: 0000000000000000 R11: 0000000000000000 R12: ffff883ffefa4e00
[  569.238836] R13: 0000000000000000 R14: 0000000000000014 R15: ffff881fff028c00
[  569.238836]  ? free_moved_vector+0x60/0x110
[  569.238836]  smp_irq_move_cleanup_interrupt+0x91/0xa9
[  569.238837]  irq_move_cleanup_interrupt+0xc/0x20
[  569.238837]  </IRQ>
[  569.238837] RIP: 0010:cpuidle_enter_state+0xb6/0x2c0
[  569.238837] RSP: 0018:ffffc9000c5d7e80 EFLAGS: 00000246
[  569.238838]  ORIG_RAX: ffffffffffffffdf
[  569.238838] RAX: ffff883ffefa1080 RBX: ffffe8ffff780d00 RCX: 000000000000001f
[  569.238838] RDX: 20c49ba5e353f7cf RSI: 0000000000000004 RDI: 0000000000000000
[  569.238838] RBP: ffffc9000c5d7eb8 R08: fffa7b7805587a32 R09: ffffffff822ec140
[  569.238839] R10: ffffc9000c5d7e60 R11: 0000000000004817 R12: 0000000000000004
[  569.238839] R13: 0000000000000004 R14: 0000000000000014 R15: 0000008487066592
[  569.238839]  cpuidle_enter+0x17/0x20
[  569.238839]  call_cpuidle+0x2d/0x40
[  569.238840]  do_idle+0x109/0x1a0
[  569.238840]  cpu_startup_entry+0x1d/0x20
[  569.238840]  start_secondary+0x10e/0x120
[  569.238840]  secondary_startup_64+0xa5/0xb0
[  569.238841] Code:
[  569.238841] 89
[  569.238841] 02
[  569.238841] 41
[  569.238841] f6
[  569.238842] 44
[  569.238842] 24
[  569.238842] 36
[  569.238842] 40
[  569.238842] 74
[  569.238842] 02
[  569.238843] 89
[  569.238843] 02
[  569.238843] 8b
[  569.238843] 44
[  569.238843] 24
[  569.238843] 1c
[  569.238844] 49
[  569.238844] 8b
[  569.238844] 96
[  569.238844] c8
[  569.238844] 00
[  569.238844] 00
[  569.238845] 00
[  569.238845] 41
[  569.238845] 89
[  569.238845] 86
[  569.238845] c0
[  569.238845] 00
[  569.238846] 00
[  569.238846] 00
[  569.238846] 41
[  569.238846] 23
[  569.238846] 84
[  569.238846] 24
[  569.238847] d4
[  569.238847] 00
[  569.238847] 00
[  569.238847] 00
[  569.238847] 0d
[  569.238847] 00
[  569.238848] 00
[  569.238848] 00
[  569.238848] 2c
[  569.238848] <89>
[  569.238848] 02
[  569.238848] 85
[  569.238849] db
[  569.238849] 0f
[  569.238849] 85
[  569.238849] e7
[  569.238849] 00
[  569.238849] 00
[  569.238850] 00
[  569.238850] f6
[  569.238850] 44
[  569.238850] 24
[  569.238850] 1b
[  569.238850] 01
[  569.238851] 74
[  569.238851] 40
[  569.238851] 49
[  569.238851] 8b
[  569.238851] 96
[  569.238851] a8
[  569.238852] RIP: bnxt_poll+0x163/0x830 RSP: ffff883ffef83b18
[  569.238852] CR2: 0000000000000000

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 14:27                 ` Thomas Gleixner
@ 2018-05-28 18:36                   ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2018-05-28 18:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tariq Toukan, Song Liu, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team



> On May 28, 2018, at 7:27 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, 28 May 2018, Tariq Toukan wrote:
>> On 28/05/2018 1:53 PM, Thomas Gleixner wrote:
>>> On Fri, 25 May 2018, Song Liu wrote:
>>>> On Wed, May 23, 2018 at 1:49 AM, Thomas Gleixner <tglx@linutronix.de>
>>>> wrote:
>>>>> On Wed, 23 May 2018, Tariq Toukan wrote:
>>>>>> I have your patch merged into my internal branch, it prints the
>>>>>> following:
>>>>>> 
>>>>>> [ 4898.226258] Trying to clear prev_vector: 0
>>>>>> [ 4898.226439] Trying to clear prev_vector: 0
>>>>>> 
>>>>>> i.e. vector(0) is lower than FIRST_EXTERNAL_VECTOR.
>>>>> 
>>>>> Could you please enable the vector and irq matrix trace points and
>>>>> capture
>>>>> the trace when this happens?
>>> 
>>> Does the patch below fix it?
>>> 
>>> Thanks,
>>> 
>>> 	tglx
>>> 
>>> 8<-------------------
>>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>>> index bb6f7a2148d7..54af3d4884b1 100644
>>> --- a/arch/x86/kernel/apic/vector.c
>>> +++ b/arch/x86/kernel/apic/vector.c
>>> @@ -148,6 +148,7 @@ static void apic_update_vector(struct irq_data *irqd,
>>> unsigned int newvec,
>>>  	 * prev_vector for this and the offlined target case.
>>>  	 */
>>>  	apicd->prev_vector = 0;
>>> +	apicd->move_in_progress = false;
>>>  	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
>>>  		goto setnew;
>>>  	/*
>>> 
>> 
>> I took it into my internal branch. Will let you know.
> 
> Note, I'd still like to see a trace w/o the patch which shows which order
> of events leads to this. Even if the patch cures it, it might just paper
> over it and not fixing the root cause.
> 
> Thanks,
> 
> 	tglx

I just checked trace without the patch. It looks the same as the one with
the patch. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 18:34               ` Song Liu
@ 2018-05-28 20:09                 ` Thomas Gleixner
       [not found]                   ` <3F47F523-64C5-422B-B9B0-73B8D105CF71@fb.com>
  2018-06-04  7:56                   ` Dou Liyang
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-28 20:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team

On Mon, 28 May 2018, Song Liu wrote:
> This doesn't fix the issue with bnxt. Here is a trace with this patch:

That's just the backtrace which is not really helpful. The real question is
what leads to this scenaria.

What I was asking for is to enable the irq_vector and irq_matrix trace
points along with ftrace_dump_on_oops. The latter dumps the trace buffer
when the problem happens and that should give us a hint what actually
causes that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
       [not found]                   ` <3F47F523-64C5-422B-B9B0-73B8D105CF71@fb.com>
@ 2018-05-29  8:35                     ` Thomas Gleixner
  2018-05-29 16:54                       ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-29  8:35 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team

On Mon, 28 May 2018, Song Liu wrote:
> > On May 28, 2018, at 1:09 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Mon, 28 May 2018, Song Liu wrote:
> >> This doesn't fix the issue with bnxt. Here is a trace with this patch:
> >
> > That's just the backtrace which is not really helpful. The real question is
> > what leads to this scenaria.
> >
> > What I was asking for is to enable the irq_vector and irq_matrix trace
> > points along with ftrace_dump_on_oops. The latter dumps the trace buffer
> > when the problem happens and that should give us a hint what actually
> > causes that.
> 
> Attached is the dump I got. It seems pretty noisy, with many message lost.
> 
> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.

Right. Sorry, I forgot to say that we only need the irq_vectors ones which
are related to vector allocation, i.e.: irq_vectors/vector_* 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-29  8:35                     ` Thomas Gleixner
@ 2018-05-29 16:54                       ` Song Liu
  2018-05-30 21:56                         ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2018-05-29 16:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team



> On May 29, 2018, at 1:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, 28 May 2018, Song Liu wrote:
>>> On May 28, 2018, at 1:09 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Mon, 28 May 2018, Song Liu wrote:
>>>> This doesn't fix the issue with bnxt. Here is a trace with this patch:
>>> 
>>> That's just the backtrace which is not really helpful. The real question is
>>> what leads to this scenaria.
>>> 
>>> What I was asking for is to enable the irq_vector and irq_matrix trace
>>> points along with ftrace_dump_on_oops. The latter dumps the trace buffer
>>> when the problem happens and that should give us a hint what actually
>>> causes that.
>> 
>> Attached is the dump I got. It seems pretty noisy, with many message lost.
>> 
>> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> 
> Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> are related to vector allocation, i.e.: irq_vectors/vector_* 
> 
> Thanks,
> 
> 	tglx

Here is the ftrace dump:

[ 1609.472697]    <...>-53363  19d... 1610359225us : vector_deactivate: irq=29 is_managed=0 can_reserve=1 reserve=0
[ 1609.472698]    <...>-53363  19d... 1610359229us : vector_clear: irq=29 vector=33 cpu=21 prev_vector=0 prev_cpu=0
[ 1609.472699]    <...>-53363  19d... 1610359230us : irq_matrix_free: bit=33 cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, global_rsvd=23, total_alloc=17
[ 1609.472699]    <...>-53363  19d... 1610359231us : irq_matrix_reserve: online_maps=56 global_avl=11239, global_rsvd=24, total_alloc=17
[ 1609.472700]    <...>-53363  19d... 1610359232us : vector_reserve: irq=29 ret=0
[ 1609.472700]    <...>-53363  19d... 1610359233us : vector_config: irq=29 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472701]    <...>-53363  19d... 1610359242us : vector_deactivate: irq=30 is_managed=0 can_reserve=1 reserve=0
[ 1609.472701]    <...>-53363  19d... 1610359242us : vector_clear: irq=30 vector=33 cpu=11 prev_vector=0 prev_cpu=1
[ 1609.472701]    <...>-53363  19d... 1610359242us : irq_matrix_free: bit=33 cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=24, total_alloc=16
[ 1609.472702]    <...>-53363  19d... 1610359243us : irq_matrix_reserve: online_maps=56 global_avl=11240, global_rsvd=25, total_alloc=16
[ 1609.472702]    <...>-53363  19d... 1610359243us : vector_reserve: irq=30 ret=0
[ 1609.472703]    <...>-53363  19d... 1610359243us : vector_config: irq=30 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472703]    <...>-53363  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0
[ 1609.472703]    <...>-53363  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2
[ 1609.472704]    <...>-53363  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15
[ 1609.472704]    <...>-53363  19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15
[ 1609.472705]    <...>-53363  19d... 1610359249us : vector_reserve: irq=31 ret=0
[ 1609.472705]    <...>-53363  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472705]    <...>-53363  19d... 1610359255us : vector_deactivate: irq=32 is_managed=0 can_reserve=1 reserve=0
[ 1609.472706]    <...>-53363  19d... 1610359255us : vector_clear: irq=32 vector=33 cpu=9 prev_vector=0 prev_cpu=3
[ 1609.472707]    <...>-53363  19d... 1610359255us : irq_matrix_free: bit=33 cpu=9 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11242, global_rsvd=26, total_alloc=14
[ 1609.472707]    <...>-53363  19d... 1610359256us : irq_matrix_reserve: online_maps=56 global_avl=11242, global_rsvd=27, total_alloc=14
[ 1609.472708]    <...>-53363  19d... 1610359256us : vector_reserve: irq=32 ret=0
[ 1609.472708]    <...>-53363  19d... 1610359256us : vector_config: irq=32 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472709]    <...>-53363  19d... 1610359261us : vector_deactivate: irq=33 is_managed=0 can_reserve=1 reserve=0
[ 1609.472709]    <...>-53363  19d... 1610359261us : vector_clear: irq=33 vector=33 cpu=18 prev_vector=0 prev_cpu=4
[ 1609.472710]    <...>-53363  19d... 1610359262us : irq_matrix_free: bit=33 cpu=18 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11243, global_rsvd=27, total_alloc=13
[ 1609.472710]    <...>-53363  19d... 1610359262us : irq_matrix_reserve: online_maps=56 global_avl=11243, global_rsvd=28, total_alloc=13
[ 1609.472711]    <...>-53363  19d... 1610359262us : vector_reserve: irq=33 ret=0
[ 1609.472711]    <...>-53363  19d... 1610359262us : vector_config: irq=33 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472711]    <...>-53363  19d... 1610359268us : vector_deactivate: irq=34 is_managed=0 can_reserve=1 reserve=0
[ 1609.472712]    <...>-53363  19d... 1610359268us : vector_clear: irq=34 vector=34 cpu=7 prev_vector=0 prev_cpu=5
[ 1609.472712]    <...>-53363  19d... 1610359268us : irq_matrix_free: bit=34 cpu=7 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11244, global_rsvd=28, total_alloc=12
[ 1609.472713]    <...>-53363  19d... 1610359268us : irq_matrix_reserve: online_maps=56 global_avl=11244, global_rsvd=29, total_alloc=12
[ 1609.472713]    <...>-53363  19d... 1610359269us : vector_reserve: irq=34 ret=0
[ 1609.472713]    <...>-53363  19d... 1610359269us : vector_config: irq=34 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472714]    <...>-53363  19d... 1610359279us : vector_deactivate: irq=35 is_managed=0 can_reserve=1 reserve=0
[ 1609.472714]    <...>-53363  19d... 1610359279us : vector_clear: irq=35 vector=33 cpu=26 prev_vector=0 prev_cpu=6
[ 1609.472715]    <...>-53363  19d... 1610359279us : irq_matrix_free: bit=33 cpu=26 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11245, global_rsvd=29, total_alloc=11
[ 1609.472715]    <...>-53363  19d... 1610359279us : irq_matrix_reserve: online_maps=56 global_avl=11245, global_rsvd=30, total_alloc=11
[ 1609.472715]    <...>-53363  19d... 1610359280us : vector_reserve: irq=35 ret=0
[ 1609.472716]    <...>-53363  19d... 1610359280us : vector_config: irq=35 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472716]    <...>-53363  19d... 1610359285us : vector_deactivate: irq=36 is_managed=0 can_reserve=1 reserve=0
[ 1609.472717]    <...>-53363  19d... 1610359285us : vector_clear: irq=36 vector=33 cpu=10 prev_vector=0 prev_cpu=7
[ 1609.472717]    <...>-53363  19d... 1610359286us : irq_matrix_free: bit=33 cpu=10 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11246, global_rsvd=30, total_alloc=10
[ 1609.472718]    <...>-53363  19d... 1610359286us : irq_matrix_reserve: online_maps=56 global_avl=11246, global_rsvd=31, total_alloc=10
[ 1609.472718]    <...>-53363  19d... 1610359286us : vector_reserve: irq=36 ret=0
[ 1609.472718]    <...>-53363  19d... 1610359286us : vector_config: irq=36 vector=239 cpu=0 apicdest=0x00000000
[ 1609.472719]    <...>-53363  20d... 1610366617us : vector_activate: irq=29 is_managed=0 can_reserve=1 reserve=0
[ 1609.472719]    <...>-53363  20d... 1610366619us : vector_alloc: irq=29 vector=4294967268 reserved=1 ret=0
[ 1609.472719]    <...>-53363  20d... 1610366621us : irq_matrix_alloc: bit=33 cpu=7 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11245, global_rsvd=30, total_alloc=11
[ 1609.472720]    <...>-53363  20d... 1610366623us : vector_update: irq=29 vector=33 cpu=7 prev_vector=0 prev_cpu=21
[ 1609.472720]    <...>-53363  20d... 1610366623us : vector_alloc: irq=29 vector=33 reserved=1 ret=0
[ 1609.472721]    <...>-53363  20d... 1610366623us : vector_config: irq=29 vector=33 cpu=7 apicdest=0x00000010
[ 1609.472721]    <...>-53363  20d... 1610366629us : irq_matrix_alloc: bit=33 cpu=21 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11244, global_rsvd=30, total_alloc=12
[ 1609.472722]    <...>-53363  20d... 1610366629us : vector_update: irq=29 vector=33 cpu=21 prev_vector=33 prev_cpu=7
[ 1609.472723]    <...>-53363  20d... 1610366629us : vector_alloc: irq=29 vector=33 reserved=1 ret=0
[ 1609.472723]    <...>-53363  20d... 1610366630us : vector_config: irq=29 vector=33 cpu=21 apicdest=0x00000030
[ 1609.472724]    <...>-53363  20d... 1610366637us : irq_matrix_alloc: bit=35 cpu=0 online=1 avl=197 alloc=4 managed=0 online_maps=56 global_avl=11243, global_rsvd=30, total_alloc=13
[ 1609.472724]    <...>-53363  20d... 1610366637us : vector_update: irq=29 vector=35 cpu=0 prev_vector=33 prev_cpu=21
[ 1609.472725]    <...>-53363  20d... 1610366637us : vector_alloc: irq=29 vector=35 reserved=1 ret=0
[ 1609.472725]    <...>-53363  20d... 1610366638us : vector_config: irq=29 vector=35 cpu=0 apicdest=0x00000000
[ 1609.472725]    <...>-53363  20dN.. 1610366643us : vector_activate: irq=30 is_managed=0 can_reserve=1 reserve=0
[ 1609.472726]    <...>-53363  20dN.. 1610366644us : irq_matrix_alloc: bit=33 cpu=11 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11242, global_rsvd=29, total_alloc=14
[ 1609.472726]    <...>-53363  20dN.. 1610366644us : vector_update: irq=30 vector=33 cpu=11 prev_vector=0 prev_cpu=11
[ 1609.472727]    <...>-53363  20dN.. 1610366644us : vector_alloc: irq=30 vector=33 reserved=1 ret=0
[ 1609.472727]    <...>-53363  20dN.. 1610366644us : vector_config: irq=30 vector=33 cpu=11 apicdest=0x00000018
[ 1609.472727]    <...>-53363  20dN.. 1610366647us : vector_config: irq=30 vector=33 cpu=11 apicdest=0x00000018
[ 1609.472728]    <...>-53363  20dN.. 1610366650us : irq_matrix_alloc: bit=34 cpu=1 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11241, global_rsvd=29, total_alloc=15
[ 1609.472728]    <...>-53363  20dN.. 1610366651us : vector_update: irq=30 vector=34 cpu=1 prev_vector=33 prev_cpu=11
[ 1609.472729]    <...>-53363  20dN.. 1610366651us : vector_alloc: irq=30 vector=34 reserved=1 ret=0
[ 1609.472729]    <...>-53363  20dN.. 1610366651us : vector_config: irq=30 vector=34 cpu=1 apicdest=0x00000002
[ 1609.472729]    <...>-53363  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 reserve=0
[ 1609.472730]    <...>-53363  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0
[ 1609.472730]    <...>-53363  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16
[ 1609.472731]    <...>-53363  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 prev_cpu=20
[ 1609.472731]    <...>-53363  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
[ 1609.472731]    <...>-53363  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 apicdest=0x00000014
[ 1609.472732]   <idle>-0       7d.h. 1610366658us : vector_free_moved: irq=29 cpu=21 vector=33 is_managed=0
[ 1609.472732]   <idle>-0       7d.h. 1610366660us : irq_matrix_free: bit=33 cpu=21 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=28, total_alloc=15
[ 1609.472733]    <...>-53363  20dN.. 1610366662us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16
[ 1609.472733]    <...>-53363  20dN.. 1610366662us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 prev_cpu=9
[ 1609.472734]    <...>-53363  20dN.. 1610366662us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
[ 1609.472734]    <...>-53363  20dN.. 1610366662us : vector_config: irq=31 vector=33 cpu=20 apicdest=0x0000002c
[ 1609.472735]    <...>-53363  20dN.. 1610366666us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, total_alloc=17
[ 1609.472735]    <...>-53363  20dN.. 1610366666us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 prev_cpu=20
[ 1609.472736]    <...>-53363  20dN.. 1610366666us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
[ 1609.472736]    <...>-53363  20dN.. 1610366666us : vector_config: irq=31 vector=34 cpu=2 apicdest=0x00000004
[ 1609.472737]    <...>-53363  20dNh. 1610366669us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0
[ 1609.472737]    <...>-53363  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16
[ 1609.472738]    <...>-53363  20dN.. 1610366671us : vector_activate: irq=32 is_managed=0 can_reserve=1 reserve=0
[ 1609.472739]    <...>-53363  20dN.. 1610366671us : irq_matrix_alloc: bit=34 cpu=9 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=27, total_alloc=17
[ 1609.472739]    <...>-53363  20dN.. 1610366671us : vector_update: irq=32 vector=34 cpu=9 prev_vector=0 prev_cpu=9
[ 1609.472739]    <...>-53363  20dN.. 1610366671us : vector_alloc: irq=32 vector=34 reserved=1 ret=0
[ 1609.472740]    <...>-53363  20dN.. 1610366672us : vector_config: irq=32 vector=34 cpu=9 apicdest=0x00000014
[ 1609.472740]    <...>-53363  20dN.. 1610366674us : vector_config: irq=32 vector=34 cpu=9 apicdest=0x00000014
[ 1609.472741]    <...>-53363  20dN.. 1610366678us : irq_matrix_alloc: bit=34 cpu=3 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11238, global_rsvd=27, total_alloc=18
[ 1609.472741]    <...>-53363  20dN.. 1610366678us : vector_update: irq=32 vector=34 cpu=3 prev_vector=34 prev_cpu=9
[ 1609.472741]    <...>-53363  20dN.. 1610366679us : vector_alloc: irq=32 vector=34 reserved=1 ret=0
[ 1609.472742]    <...>-53363  20dN.. 1610366679us : vector_config: irq=32 vector=34 cpu=3 apicdest=0x00000006
[ 1609.472742]   <idle>-0      11d.h. 1610366680us : vector_free_moved: irq=30 cpu=11 vector=33 is_managed=0
[ 1609.472743]   <idle>-0      11d.h. 1610366681us : irq_matrix_free: bit=33 cpu=11 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11239, global_rsvd=27, total_alloc=17
[ 1609.472743]    <...>-53363  20dN.. 1610366683us : vector_activate: irq=33 is_managed=0 can_reserve=1 reserve=0
[ 1609.472743]    <...>-53363  20dN.. 1610366683us : vector_alloc: irq=33 vector=4294967268 reserved=1 ret=0
[ 1609.472744]    <...>-53363  20dN.. 1610366684us : irq_matrix_alloc: bit=33 cpu=10 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11238, global_rsvd=26, total_alloc=18
[ 1609.472744]    <...>-53363  20dN.. 1610366684us : vector_update: irq=33 vector=33 cpu=10 prev_vector=0 prev_cpu=18
[ 1609.472745]    <...>-53363  20dN.. 1610366684us : vector_alloc: irq=33 vector=33 reserved=1 ret=0
[ 1609.472745]    <...>-53363  20dN.. 1610366685us : vector_config: irq=33 vector=33 cpu=10 apicdest=0x00000016
[ 1609.472745]    <...>-53363  20dN.. 1610366688us : irq_matrix_alloc: bit=33 cpu=18 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11237, global_rsvd=26, total_alloc=19
[ 1609.472746]    <...>-53363  20dN.. 1610366688us : vector_update: irq=33 vector=33 cpu=18 prev_vector=33 prev_cpu=10
[ 1609.472746]    <...>-53363  20dN.. 1610366688us : vector_alloc: irq=33 vector=33 reserved=1 ret=0
[ 1609.472747]    <...>-53363  20dN.. 1610366688us : vector_config: irq=33 vector=33 cpu=18 apicdest=0x00000028
[ 1609.472747]    <...>-53363  20dN.. 1610366692us : irq_matrix_alloc: bit=34 cpu=4 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11236, global_rsvd=26, total_alloc=20
[ 1609.472747]    <...>-53363  20dN.. 1610366692us : vector_update: irq=33 vector=34 cpu=4 prev_vector=33 prev_cpu=18
[ 1609.472748]    <...>-53363  20dN.. 1610366692us : vector_alloc: irq=33 vector=34 reserved=1 ret=0
[ 1609.472748]    <...>-53363  20dN.. 1610366693us : vector_config: irq=33 vector=34 cpu=4 apicdest=0x00000008
[ 1609.472749]   <idle>-0       9d.h. 1610366694us : vector_free_moved: irq=32 cpu=9 vector=34 is_managed=0
[ 1609.472749]   <idle>-0       9d.h. 1610366695us : irq_matrix_free: bit=34 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11237, global_rsvd=26, total_alloc=19
[ 1609.472750]   <idle>-0       9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0


Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-29 16:54                       ` Song Liu
@ 2018-05-30 21:56                         ` Thomas Gleixner
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2018-05-30 21:56 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team

Song,

On Tue, 29 May 2018, Song Liu wrote:
> > On May 29, 2018, at 1:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Maybe we cannot enable all trace points under irq_vectors/ and irq_matrix.
> > 
> > Right. Sorry, I forgot to say that we only need the irq_vectors ones which
> > are related to vector allocation, i.e.: irq_vectors/vector_*
>
> Here is the ftrace dump:

Thanks for providing the data!

>  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 can_reserve=1 reserve=0
>  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 prev_vector=0 prev_cpu=2
>  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, global_rsvd=25, total_alloc=15

Here IRQ 31 is shutdown and the vector freed.

>  19d... 1610359249us : irq_matrix_reserve: online_maps=56 global_avl=11241, global_rsvd=26, total_alloc=15
>  19d... 1610359249us : vector_reserve: irq=31 ret=0
>  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 apicdest=0x00000000

And set to the magic reservation vector 239 to catch spurious interrupts.

>  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 can_reserve=1 reserve=0
>  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 reserved=1 ret=0
>  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16
>  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 prev_vector=0 prev_cpu=20
>  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 apicdest=0x00000014

So here it gets initialized again and targets CPU9 now.

>  20dN.. 1610366662us : irq_matrix_alloc: bit=33 cpu=20 online=1 avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16
>  20dN.. 1610366662us : vector_update: irq=31 vector=33 cpu=20 prev_vector=33 prev_cpu=9
>  20dN.. 1610366662us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  20dN.. 1610366662us : vector_config: irq=31 vector=33 cpu=20 apicdest=0x0000002c

Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
CPU9 for cleanup.

>  20dN.. 1610366666us : irq_matrix_alloc: bit=34 cpu=2 online=1 avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, global_rsvd=28, total_alloc=17
>  20dN.. 1610366666us : vector_update: irq=31 vector=34 cpu=2 prev_vector=33 prev_cpu=20
>  20dN.. 1610366666us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
>  20dN.. 1610366666us : vector_config: irq=31 vector=34 cpu=2 apicdest=0x00000004

So here the shit hits the fan because that update schedules vector 33 on
CPU20 for cleanup while the previous cleanup for CPU9 has not been done
yet. Cute. or not so cute....

>  20dNh. 1610366669us : vector_free_moved: irq=31 cpu=20 vector=33 is_managed=0
>  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, global_rsvd=28, total_alloc=16

And frees the CPU 20 vector

>   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 is_managed=0

And then CPU9 claims that it's queued for cleanup. Bah.

I'm still working on a fix as the elegant solution refused to work.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-05-28 20:09                 ` Thomas Gleixner
       [not found]                   ` <3F47F523-64C5-422B-B9B0-73B8D105CF71@fb.com>
@ 2018-06-04  7:56                   ` Dou Liyang
  2018-06-04 11:17                     ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Dou Liyang @ 2018-06-04  7:56 UTC (permalink / raw)
  To: Thomas Gleixner, Song Liu
  Cc: Song Liu, Tariq Toukan, Dmitry Safonov, open list, Maor Gottlieb,
	Kernel Team

Hi Thomas,

Sorry to ask the questions at this series, my mailbox was kicked out of
the mailing list a few days ago, and didn't receive the e-mail.

please see below

At 05/29/2018 04:09 AM, Thomas Gleixner wrote:
> On Mon, 28 May 2018, Song Liu wrote:
>> This doesn't fix the issue with bnxt. Here is a trace with this patch:

[...]
> 
> Thanks for providing the data!
> 
>  >  19d... 1610359248us : vector_deactivate: irq=31 is_managed=0 
> can_reserve=1 reserve=0
>  >  19d... 1610359248us : vector_clear: irq=31 vector=33 cpu=20 
> prev_vector=0 prev_cpu=2
>  >  19d... 1610359249us : irq_matrix_free: bit=33 cpu=20 online=1 
> avl=201 alloc=0 managed=0 online_maps=56 global_avl=11241, 
> global_rsvd=25, total_alloc=15
> 
> Here IRQ 31 is shutdown and the vector freed.
> 
>  >  19d... 1610359249us : irq_matrix_reserve: online_maps=56 
> global_avl=11241, global_rsvd=26, total_alloc=15
>  >  19d... 1610359249us : vector_reserve: irq=31 ret=0
>  >  19d... 1610359249us : vector_config: irq=31 vector=239 cpu=0 
> apicdest=0x00000000
> 
> And set to the magic reservation vector 239 to catch spurious interrupts.
> 
>  >  20dN.. 1610366654us : vector_activate: irq=31 is_managed=0 
> can_reserve=1 reserve=0
>  >  20dN.. 1610366654us : vector_alloc: irq=31 vector=4294967268 
> reserved=1 ret=0
>  >  20dN.. 1610366655us : irq_matrix_alloc: bit=33 cpu=9 online=1 
> avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
> global_rsvd=28, total_alloc=16
>  >  20dN.. 1610366655us : vector_update: irq=31 vector=33 cpu=9 
> prev_vector=0 prev_cpu=20
   ^^^^^^^^^^^^
    this means there is no associated previous vector.

>  >  20dN.. 1610366656us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  >  20dN.. 1610366656us : vector_config: irq=31 vector=33 cpu=9 
> apicdest=0x00000014
> 
> So here it gets initialized again and targets CPU9 now.
> 
>  >  20dN.. 1610366662us : irq_matrix_alloc: bit=33 cpu=20 online=1 
> avl=200 alloc=1 managed=0 online_maps=56 global_avl=11240, 
> global_rsvd=28, total_alloc=16
>  >  20dN.. 1610366662us : vector_update: irq=31 vector=33 cpu=20 
> prev_vector=33 prev_cpu=9
>  >  20dN.. 1610366662us : vector_alloc: irq=31 vector=33 reserved=1 ret=0
>  >  20dN.. 1610366662us : vector_config: irq=31 vector=33 cpu=20 
> apicdest=0x0000002c
> 
> Here it is reconfigured to CPU 20. Now that update schedules vector 33 on
> CPU9 for cleanup.
> 
>  >  20dN.. 1610366666us : irq_matrix_alloc: bit=34 cpu=2 online=1 
> avl=199 alloc=2 managed=0 online_maps=56 global_avl=11239, 
> global_rsvd=28, total_alloc=17
>  >  20dN.. 1610366666us : vector_update: irq=31 vector=34 cpu=2 
> prev_vector=33 prev_cpu=20
>  >  20dN.. 1610366666us : vector_alloc: irq=31 vector=34 reserved=1 ret=0
>  >  20dN.. 1610366666us : vector_config: irq=31 vector=34 cpu=2 
> apicdest=0x00000004
> 
> So here the shit hits the fan because that update schedules vector 33 on
> CPU20 for cleanup while the previous cleanup for CPU9 has not been done
> yet. Cute. or not so cute....
> 
>  >  20dNh. 1610366669us : vector_free_moved: irq=31 cpu=20 vector=33 
> is_managed=0
>  >  20dNh. 1610366670us : irq_matrix_free: bit=33 cpu=20 online=1 
> avl=201 alloc=0 managed=0 online_maps=56 global_avl=11240, 
> global_rsvd=28, total_alloc=16
> 
> And frees the CPU 20 vector
> 
>  >   9d.h. 1610366696us : vector_free_moved: irq=31 cpu=20 vector=0 
> is_managed=0
> 

Here, why didn't we avoid this cleanup by

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a75de0792942..0cc59646755f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data 
*apicd)
          */
         WARN_ON_ONCE(managed);

+       if (!vector)
+               return;
+
         trace_vector_free_moved(apicd->irq, cpu, vector, managed);
         irq_matrix_free(vector_matrix, cpu, vector, managed);
         per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

Is there something I didn't consider with? ;-)

Thanks,
	dou.

> And then CPU9 claims that it's queued for cleanup. Bah.
> 
> I'm still working on a fix as the elegant solution refused to work.
> 
> Thanks,
> 
>      tglx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-06-04  7:56                   ` Dou Liyang
@ 2018-06-04 11:17                     ` Thomas Gleixner
  2018-06-04 11:59                       ` Dou Liyang
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2018-06-04 11:17 UTC (permalink / raw)
  To: Dou Liyang
  Cc: Song Liu, Song Liu, Tariq Toukan, Dmitry Safonov, open list,
	Maor Gottlieb, Kernel Team

On Mon, 4 Jun 2018, Dou Liyang wrote:
> Here, why didn't we avoid this cleanup by
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index a75de0792942..0cc59646755f 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
> *apicd)
>          */
>         WARN_ON_ONCE(managed);
> 
> +       if (!vector)
> +               return;
> +
>         trace_vector_free_moved(apicd->irq, cpu, vector, managed);
>         irq_matrix_free(vector_matrix, cpu, vector, managed);
>         per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
> 
> Is there something I didn't consider with? ;-)

Well, that just prevents the warning, but the hlist is already
corrupted. So you'd just cure the symptom ...

I'm about to send a patch series which addresses that. Just need to finish
writing changelogs.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: WARNING and PANIC in irq_matrix_free
  2018-06-04 11:17                     ` Thomas Gleixner
@ 2018-06-04 11:59                       ` Dou Liyang
  0 siblings, 0 replies; 25+ messages in thread
From: Dou Liyang @ 2018-06-04 11:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, Song Liu, Tariq Toukan, Dmitry Safonov, open list,
	Maor Gottlieb, Kernel Team

Hi Thomas,

At 06/04/2018 07:17 PM, Thomas Gleixner wrote:
> On Mon, 4 Jun 2018, Dou Liyang wrote:
>> Here, why didn't we avoid this cleanup by
>>
>> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
>> index a75de0792942..0cc59646755f 100644
>> --- a/arch/x86/kernel/apic/vector.c
>> +++ b/arch/x86/kernel/apic/vector.c
>> @@ -821,6 +821,9 @@ static void free_moved_vector(struct apic_chip_data
>> *apicd)
>>           */
>>          WARN_ON_ONCE(managed);
>>
>> +       if (!vector)
>> +               return;
>> +
>>          trace_vector_free_moved(apicd->irq, cpu, vector, managed);
>>          irq_matrix_free(vector_matrix, cpu, vector, managed);
>>          per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
>>
>> Is there something I didn't consider with? ;-)
> 
> Well, that just prevents the warning, but the hlist is already
> corrupted. So you'd just cure the symptom ...
> 

I see.

> I'm about to send a patch series which addresses that. Just need to finish
> writing changelogs.
> 

Thank you for telling me that.

Thanks,
	dou

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-06-04 11:59 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-20 12:07 WARNING and PANIC in irq_matrix_free Tariq Toukan
2018-02-20 18:11 ` Thomas Gleixner
2018-02-20 18:18   ` Thomas Gleixner
2018-02-21 11:27     ` Tariq Toukan
2018-02-22 21:38       ` Thomas Gleixner
2018-02-25  9:50         ` Tariq Toukan
2018-05-18 22:41   ` Dmitry Safonov
2018-05-18 23:43     ` Dmitry Safonov
2018-05-19 11:20     ` Thomas Gleixner
2018-05-23  7:16       ` Tariq Toukan
2018-05-23  8:49         ` Thomas Gleixner
2018-05-25 20:10           ` Song Liu
2018-05-25 21:29             ` Song Liu
2018-05-28 10:53             ` Thomas Gleixner
2018-05-28 11:17               ` Tariq Toukan
2018-05-28 14:27                 ` Thomas Gleixner
2018-05-28 18:36                   ` Song Liu
2018-05-28 18:34               ` Song Liu
2018-05-28 20:09                 ` Thomas Gleixner
     [not found]                   ` <3F47F523-64C5-422B-B9B0-73B8D105CF71@fb.com>
2018-05-29  8:35                     ` Thomas Gleixner
2018-05-29 16:54                       ` Song Liu
2018-05-30 21:56                         ` Thomas Gleixner
2018-06-04  7:56                   ` Dou Liyang
2018-06-04 11:17                     ` Thomas Gleixner
2018-06-04 11:59                       ` Dou Liyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.