All of lore.kernel.org
 help / color / mirror / Atom feed
* IPMI related kernel panics since v4.19.286
@ 2023-06-19 11:11 Janne Huttunen (Nokia)
  2023-06-19 12:23 ` Greg KH
  2023-06-19 12:52 ` Corey Minyard
  0 siblings, 2 replies; 6+ messages in thread
From: Janne Huttunen (Nokia) @ 2023-06-19 11:11 UTC (permalink / raw)
  To: minyard, stable


We recently updated an internal test server from kernel v4.19.273
to v4.19.286 and since then it has already multiple times triggered
a kernel panic due to a hard lockup. The lockups look e.g. like
this:

[29397.950589] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
[29397.950590] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
[29397.950591] RSP: 0000:ffff93d07f703de8 EFLAGS: 00000002
[29397.950591] RAX: 0000000000000101 RBX: ffff93cf90087220 RCX: 0000000000000006
[29397.950592] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff93cf90087220
[29397.950592] RBP: ffff93d07f703de8 R08: 0000000000000000 R09: ffff93d07f71fb70
[29397.950593] R10: ffff93d07f71fb28 R11: ffff93d07f703ee8 R12: 0000000000000002
[29397.950593] R13: ffff93cf852cde58 R14: ffff93cf852cc000 R15: 0000000000000003
[29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
[29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
[29397.950594]  </NMI>
[29397.950595]  <IRQ>
[29397.950595]  _raw_spin_lock_irqsave+0x46/0x50
[29397.950595]  set_need_watch+0x2d/0x70 [ipmi_si]
[29397.950596]  ? _raw_spin_lock_irqsave+0x25/0x50
[29397.950596]  ipmi_timeout+0x2b4/0x530 [ipmi_msghandler]
[29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
[29397.950597]  call_timer_fn+0x30/0x130
[29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
[29397.950598]  run_timer_softirq+0x1ce/0x3f0
[29397.950598]  ? ktime_get+0x40/0xa0
[29397.950598]  ? sched_clock+0x9/0x10
[29397.950599]  ? sched_clock_cpu+0x11/0xc0
[29397.950599]  __do_softirq+0x104/0x32d
[29397.950600]  ? sched_clock_cpu+0x11/0xc0
[29397.950600]  irq_exit+0x11b/0x120
[29397.950600]  smp_apic_timer_interrupt+0x79/0x140
[29397.950601]  apic_timer_interrupt+0xf/0x20
[29397.950601]  </IRQ>


And like this:

[16944.269585] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
[16944.269586] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
[16944.269587] RSP: 0018:ffff9a4e40b03dd0 EFLAGS: 00000002
[16944.269588] RAX: 0000000000580101 RBX: ffff9a4ac5089e98 RCX: ffff9a2b9574b538
[16944.269588] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a4ac5089e98
[16944.269589] RBP: ffff9a4e40b03dd0 R08: ffff9a4ac5089e58 R09: ffff9a4e40b1fb70
[16944.269589] R10: ffff9a4e40b1fb28 R11: ffff9a4e40b03ee8 R12: 0000000000000002
[16944.269590] R13: ffff9a4ac5089e98 R14: ffff9a4ac5089e58 R15: ffff9a4ac5089e54
[16944.269590]  ? native_queued_spin_lock_slowpath+0x57/0x190
[16944.269591]  ? native_queued_spin_lock_slowpath+0x57/0x190
[16944.269591]  </NMI>
[16944.269592]  <IRQ>
[16944.269592]  _raw_spin_lock_irqsave+0x46/0x50
[16944.269592]  ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
[16944.269593]  smi_event_handler+0x26c/0x650 [ipmi_si]
[16944.269593]  smi_timeout+0x46/0xc0 [ipmi_si]
[16944.269594]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
[16944.269594]  call_timer_fn+0x30/0x130
[16944.269595]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
[16944.269595]  run_timer_softirq+0x1ce/0x3f0
[16944.269595]  ? ktime_get+0x40/0xa0
[16944.269596]  ? sched_clock+0x9/0x10
[16944.269596]  ? sched_clock_cpu+0x11/0xc0
[16944.269597]  __do_softirq+0x104/0x32d
[16944.269597]  ? sched_clock_cpu+0x11/0xc0
[16944.269597]  irq_exit+0x11b/0x120
[16944.269598]  smp_apic_timer_interrupt+0x79/0x140
[16944.269598]  apic_timer_interrupt+0xf/0x20
[16944.269599]  </IRQ>


To me these would look like an ordering violation between
"smi_info->si_lock" and "intf->xmit_msgs_lock", probably
introduced by this commit:

commit b4a34aa6dfbca67610e56ad84a3595f537c85af9
Author: Corey Minyard <cminyard@mvista.com>
Date:   Tue Oct 23 11:29:02 2018 -0500

    ipmi: Fix how the lower layers are told to watch for messages
    
    [ Upstream commit c65ea996595005be470fbfa16711deba414fd33b ]
    

In order to test the theory further I built and booted
a kernel with lockdep and this happened:

[  215.679605] kipmi0/1465 is trying to acquire lock:
[  215.684490] 00000000fc1528d3 (&(&new_smi->si_lock)->rlock){..-.}, at: set_need_watch+0x2d/0x70 [ipmi_si]
[  215.694073] 
               but task is already holding lock:
[  215.699995] 00000000e2eea01c (&(&intf->xmit_msgs_lock)->rlock){..-.}, at: smi_recv_tasklet+0x170/0x260 [ipmi_msghandler]
[  215.710966] 
               which lock already depends on the new lock.

[  215.719243] 
               the existing dependency chain (in reverse order) is:
[  215.726824] 
               -> #1 (&(&intf->xmit_msgs_lock)->rlock){..-.}:
[  215.733890]        lock_acquire+0xae/0x180
[  215.738111]        _raw_spin_lock_irqsave+0x4d/0x8a
[  215.743113]        ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
[  215.749766]        smi_event_handler+0x26c/0x660 [ipmi_si]
[  215.755378]        ipmi_thread+0x5d/0x200 [ipmi_si]
[  215.760377]        kthread+0x13c/0x160
[  215.764247]        ret_from_fork+0x24/0x50
[  215.768457] 
               -> #0 (&(&new_smi->si_lock)->rlock){..-.}:
[  215.775220]        __lock_acquire+0xa61/0xfb0
[  215.779700]        lock_acquire+0xae/0x180
[  215.783912]        _raw_spin_lock_irqsave+0x4d/0x8a
[  215.788915]        set_need_watch+0x2d/0x70 [ipmi_si]
[  215.794091]        smi_tell_to_watch.constprop.39+0x4a/0x50 [ipmi_msghandler]
[  215.801353]        smi_recv_tasklet+0xe8/0x260 [ipmi_msghandler]
[  215.807484]        tasklet_action_common.isra.14+0x83/0x1a0
[  215.813177]        tasklet_action+0x22/0x30
[  215.817479]        __do_softirq+0xd4/0x3ef
[  215.821698]        irq_exit+0x120/0x130
[  215.825649]        smp_irq_work_interrupt+0x8c/0x190
[  215.830732]        irq_work_interrupt+0xf/0x20
[  215.835296]        _raw_spin_unlock_irqrestore+0x56/0x60
[  215.840725]        ipmi_thread+0x13e/0x200 [ipmi_si]
[  215.845811]        kthread+0x13c/0x160
[  215.849683]        ret_from_fork+0x24/0x50
[  215.853894] 
               other info that might help us debug this:

[  215.862075]  Possible unsafe locking scenario:

[  215.868139]        CPU0                    CPU1
[  215.872788]        ----                    ----
[  215.877433]   lock(&(&intf->xmit_msgs_lock)->rlock);
[  215.882512]                                lock(&(&new_smi->si_lock)->rlock);
[  215.889776]                                lock(&(&intf->xmit_msgs_lock)->rlock);
[  215.897413]   lock(&(&new_smi->si_lock)->rlock);
[  215.902151] 
                *** DEADLOCK ***

[  215.908250] 2 locks held by kipmi0/1465:


This seems to also indicate the same locks as culprits
although the backtraces look different from the actual
crashes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: IPMI related kernel panics since v4.19.286
  2023-06-19 11:11 IPMI related kernel panics since v4.19.286 Janne Huttunen (Nokia)
@ 2023-06-19 12:23 ` Greg KH
  2023-06-19 13:32   ` Corey Minyard
  2023-06-19 12:52 ` Corey Minyard
  1 sibling, 1 reply; 6+ messages in thread
From: Greg KH @ 2023-06-19 12:23 UTC (permalink / raw)
  To: Janne Huttunen (Nokia); +Cc: minyard, stable

On Mon, Jun 19, 2023 at 11:11:16AM +0000, Janne Huttunen (Nokia) wrote:
> 
> We recently updated an internal test server from kernel v4.19.273
> to v4.19.286 and since then it has already multiple times triggered
> a kernel panic due to a hard lockup. The lockups look e.g. like
> this:

Does this also happen on newer 5.4 and 6.4-rc7 releases?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: IPMI related kernel panics since v4.19.286
  2023-06-19 11:11 IPMI related kernel panics since v4.19.286 Janne Huttunen (Nokia)
  2023-06-19 12:23 ` Greg KH
@ 2023-06-19 12:52 ` Corey Minyard
  2023-06-20  9:56   ` Janne Huttunen (Nokia)
  1 sibling, 1 reply; 6+ messages in thread
From: Corey Minyard @ 2023-06-19 12:52 UTC (permalink / raw)
  To: Janne Huttunen (Nokia); +Cc: stable

On Mon, Jun 19, 2023 at 11:11:16AM +0000, Janne Huttunen (Nokia) wrote:
> 
> We recently updated an internal test server from kernel v4.19.273
> to v4.19.286 and since then it has already multiple times triggered
> a kernel panic due to a hard lockup. The lockups look e.g. like
> this:

It looks like

  b4a34aa6d "ipmi: Fix how the lower layers are told to watch for messages"

was backported to fullfill a dependency for another backport, but there
was another change:

  e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately when not needed"

That is needed to avoid calling a lower layer function with
xmit_msgs_lock held.  It doesn't apply completely cleanly because of
other changes, but you just need to leave in the free_user_work()
function and delete the other function in the conflict.  In addition to
that, you will also need:

  383035211c79 "ipmi: move message error checking to avoid deadlock"

to fix a bug in that change.

Can you try this out?

Thanks,

-corey

> 
> [29397.950589] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950590] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [29397.950591] RSP: 0000:ffff93d07f703de8 EFLAGS: 00000002
> [29397.950591] RAX: 0000000000000101 RBX: ffff93cf90087220 RCX: 0000000000000006
> [29397.950592] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff93cf90087220
> [29397.950592] RBP: ffff93d07f703de8 R08: 0000000000000000 R09: ffff93d07f71fb70
> [29397.950593] R10: ffff93d07f71fb28 R11: ffff93d07f703ee8 R12: 0000000000000002
> [29397.950593] R13: ffff93cf852cde58 R14: ffff93cf852cc000 R15: 0000000000000003
> [29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594]  </NMI>
> [29397.950595]  <IRQ>
> [29397.950595]  _raw_spin_lock_irqsave+0x46/0x50
> [29397.950595]  set_need_watch+0x2d/0x70 [ipmi_si]
> [29397.950596]  ? _raw_spin_lock_irqsave+0x25/0x50
> [29397.950596]  ipmi_timeout+0x2b4/0x530 [ipmi_msghandler]
> [29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950597]  call_timer_fn+0x30/0x130
> [29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950598]  run_timer_softirq+0x1ce/0x3f0
> [29397.950598]  ? ktime_get+0x40/0xa0
> [29397.950598]  ? sched_clock+0x9/0x10
> [29397.950599]  ? sched_clock_cpu+0x11/0xc0
> [29397.950599]  __do_softirq+0x104/0x32d
> [29397.950600]  ? sched_clock_cpu+0x11/0xc0
> [29397.950600]  irq_exit+0x11b/0x120
> [29397.950600]  smp_apic_timer_interrupt+0x79/0x140
> [29397.950601]  apic_timer_interrupt+0xf/0x20
> [29397.950601]  </IRQ>
> 
> 
> And like this:
> 
> [16944.269585] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269586] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [16944.269587] RSP: 0018:ffff9a4e40b03dd0 EFLAGS: 00000002
> [16944.269588] RAX: 0000000000580101 RBX: ffff9a4ac5089e98 RCX: ffff9a2b9574b538
> [16944.269588] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a4ac5089e98
> [16944.269589] RBP: ffff9a4e40b03dd0 R08: ffff9a4ac5089e58 R09: ffff9a4e40b1fb70
> [16944.269589] R10: ffff9a4e40b1fb28 R11: ffff9a4e40b03ee8 R12: 0000000000000002
> [16944.269590] R13: ffff9a4ac5089e98 R14: ffff9a4ac5089e58 R15: ffff9a4ac5089e54
> [16944.269590]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591]  </NMI>
> [16944.269592]  <IRQ>
> [16944.269592]  _raw_spin_lock_irqsave+0x46/0x50
> [16944.269592]  ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [16944.269593]  smi_event_handler+0x26c/0x650 [ipmi_si]
> [16944.269593]  smi_timeout+0x46/0xc0 [ipmi_si]
> [16944.269594]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269594]  call_timer_fn+0x30/0x130
> [16944.269595]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269595]  run_timer_softirq+0x1ce/0x3f0
> [16944.269595]  ? ktime_get+0x40/0xa0
> [16944.269596]  ? sched_clock+0x9/0x10
> [16944.269596]  ? sched_clock_cpu+0x11/0xc0
> [16944.269597]  __do_softirq+0x104/0x32d
> [16944.269597]  ? sched_clock_cpu+0x11/0xc0
> [16944.269597]  irq_exit+0x11b/0x120
> [16944.269598]  smp_apic_timer_interrupt+0x79/0x140
> [16944.269598]  apic_timer_interrupt+0xf/0x20
> [16944.269599]  </IRQ>
> 
> 
> To me these would look like an ordering violation between
> "smi_info->si_lock" and "intf->xmit_msgs_lock", probably
> introduced by this commit:
> 
> commit b4a34aa6dfbca67610e56ad84a3595f537c85af9
> Author: Corey Minyard <cminyard@mvista.com>
> Date:   Tue Oct 23 11:29:02 2018 -0500
> 
>     ipmi: Fix how the lower layers are told to watch for messages
>     
>     [ Upstream commit c65ea996595005be470fbfa16711deba414fd33b ]
>     
> 
> In order to test the theory further I built and booted
> a kernel with lockdep and this happened:
> 
> [  215.679605] kipmi0/1465 is trying to acquire lock:
> [  215.684490] 00000000fc1528d3 (&(&new_smi->si_lock)->rlock){..-.}, at: set_need_watch+0x2d/0x70 [ipmi_si]
> [  215.694073] 
>                but task is already holding lock:
> [  215.699995] 00000000e2eea01c (&(&intf->xmit_msgs_lock)->rlock){..-.}, at: smi_recv_tasklet+0x170/0x260 [ipmi_msghandler]
> [  215.710966] 
>                which lock already depends on the new lock.
> 
> [  215.719243] 
>                the existing dependency chain (in reverse order) is:
> [  215.726824] 
>                -> #1 (&(&intf->xmit_msgs_lock)->rlock){..-.}:
> [  215.733890]        lock_acquire+0xae/0x180
> [  215.738111]        _raw_spin_lock_irqsave+0x4d/0x8a
> [  215.743113]        ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [  215.749766]        smi_event_handler+0x26c/0x660 [ipmi_si]
> [  215.755378]        ipmi_thread+0x5d/0x200 [ipmi_si]
> [  215.760377]        kthread+0x13c/0x160
> [  215.764247]        ret_from_fork+0x24/0x50
> [  215.768457] 
>                -> #0 (&(&new_smi->si_lock)->rlock){..-.}:
> [  215.775220]        __lock_acquire+0xa61/0xfb0
> [  215.779700]        lock_acquire+0xae/0x180
> [  215.783912]        _raw_spin_lock_irqsave+0x4d/0x8a
> [  215.788915]        set_need_watch+0x2d/0x70 [ipmi_si]
> [  215.794091]        smi_tell_to_watch.constprop.39+0x4a/0x50 [ipmi_msghandler]
> [  215.801353]        smi_recv_tasklet+0xe8/0x260 [ipmi_msghandler]
> [  215.807484]        tasklet_action_common.isra.14+0x83/0x1a0
> [  215.813177]        tasklet_action+0x22/0x30
> [  215.817479]        __do_softirq+0xd4/0x3ef
> [  215.821698]        irq_exit+0x120/0x130
> [  215.825649]        smp_irq_work_interrupt+0x8c/0x190
> [  215.830732]        irq_work_interrupt+0xf/0x20
> [  215.835296]        _raw_spin_unlock_irqrestore+0x56/0x60
> [  215.840725]        ipmi_thread+0x13e/0x200 [ipmi_si]
> [  215.845811]        kthread+0x13c/0x160
> [  215.849683]        ret_from_fork+0x24/0x50
> [  215.853894] 
>                other info that might help us debug this:
> 
> [  215.862075]  Possible unsafe locking scenario:
> 
> [  215.868139]        CPU0                    CPU1
> [  215.872788]        ----                    ----
> [  215.877433]   lock(&(&intf->xmit_msgs_lock)->rlock);
> [  215.882512]                                lock(&(&new_smi->si_lock)->rlock);
> [  215.889776]                                lock(&(&intf->xmit_msgs_lock)->rlock);
> [  215.897413]   lock(&(&new_smi->si_lock)->rlock);
> [  215.902151] 
>                 *** DEADLOCK ***
> 
> [  215.908250] 2 locks held by kipmi0/1465:
> 
> 
> This seems to also indicate the same locks as culprits
> although the backtraces look different from the actual
> crashes.
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: IPMI related kernel panics since v4.19.286
  2023-06-19 12:23 ` Greg KH
@ 2023-06-19 13:32   ` Corey Minyard
  0 siblings, 0 replies; 6+ messages in thread
From: Corey Minyard @ 2023-06-19 13:32 UTC (permalink / raw)
  To: Greg KH; +Cc: Janne Huttunen (Nokia), stable

On Mon, Jun 19, 2023 at 02:23:47PM +0200, Greg KH wrote:
> On Mon, Jun 19, 2023 at 11:11:16AM +0000, Janne Huttunen (Nokia) wrote:
> > 
> > We recently updated an internal test server from kernel v4.19.273
> > to v4.19.286 and since then it has already multiple times triggered
> > a kernel panic due to a hard lockup. The lockups look e.g. like
> > this:
> 
> Does this also happen on newer 5.4 and 6.4-rc7 releases?

If I am correct, no, it will not happen there.  I was surprised 

I believe the change that causes this issue was:

  b4a34aa6d "ipmi: Fix how the lower layers are told to watch for messages"

which was backported into 4.9.  But there was a locking change:

  e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately when not needed"

that came in 5.1.  That was required for b4a34aa6d to work correctly,
and I didn't realize that.  If Janne tests that patch (and the other fix
for it), then kernels 5.1 and later are good as they are.

Thanks,

-corey

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: IPMI related kernel panics since v4.19.286
  2023-06-19 12:52 ` Corey Minyard
@ 2023-06-20  9:56   ` Janne Huttunen (Nokia)
  2023-06-20 11:54     ` Corey Minyard
  0 siblings, 1 reply; 6+ messages in thread
From: Janne Huttunen (Nokia) @ 2023-06-20  9:56 UTC (permalink / raw)
  To: minyard; +Cc: stable, gregkh


> It looks like
> 
>   b4a34aa6d "ipmi: Fix how the lower layers are told to watch for
> messages"
> 
> was backported to fullfill a dependency for another backport, but
> there was another change:
> 
>   e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately
> when not needed"
> 
> That is needed to avoid calling a lower layer function with
> xmit_msgs_lock held.  It doesn't apply completely cleanly because of
> other changes, but you just need to leave in the free_user_work()
> function and delete the other function in the conflict.  In addition
> to that, you will also need:
> 
>   383035211c79 "ipmi: move message error checking to avoid deadlock"
> 
> to fix a bug in that change.
> 
> Can you try this out?

Yes, sorry for the delay, had a bit of technical problems testing
your proposed patches. In the meantime we found out that over
a dozen of our test servers have had the same crash, some of them
multiple times since the kernel update.

Anyways, with your proposed patches on top of 4.19.286, I couldn't
trigger the lockdep warning anymore even in a server that without
the fixes triggers it very reliably right after the boot. I also
saw in another very similar server (without the fixes) that it
took almost 17 hours to get even the lockdep warning. Maybe some
specific BMC behavior affects this or something? Sadly, that kind
of diminishes the value of the short duration tests, but at least
there has so far been zero lockdep warnings with the fixes applied.
The actual lockups are then way too unpredictable to test reliably
in any kind of short time frame.

Anyways, looking at e1891cffd4c4, it's right there where the issue
seems to originate from, so it makes total sense to me that it does
fix it. I was already kind of looking at it when you confirmed it.
Thanks for pointing out also the 383035211c79 patch, it might have
been easily missed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: IPMI related kernel panics since v4.19.286
  2023-06-20  9:56   ` Janne Huttunen (Nokia)
@ 2023-06-20 11:54     ` Corey Minyard
  0 siblings, 0 replies; 6+ messages in thread
From: Corey Minyard @ 2023-06-20 11:54 UTC (permalink / raw)
  To: Janne Huttunen (Nokia); +Cc: stable, gregkh

On Tue, Jun 20, 2023 at 09:56:50AM +0000, Janne Huttunen (Nokia) wrote:
> 
> > It looks like
> > 
> >   b4a34aa6d "ipmi: Fix how the lower layers are told to watch for
> > messages"
> > 
> > was backported to fullfill a dependency for another backport, but
> > there was another change:
> > 
> >   e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately
> > when not needed"
> > 
> > That is needed to avoid calling a lower layer function with
> > xmit_msgs_lock held.  It doesn't apply completely cleanly because of
> > other changes, but you just need to leave in the free_user_work()
> > function and delete the other function in the conflict.  In addition
> > to that, you will also need:
> > 
> >   383035211c79 "ipmi: move message error checking to avoid deadlock"
> > 
> > to fix a bug in that change.
> > 
> > Can you try this out?
> 
> Yes, sorry for the delay, had a bit of technical problems testing
> your proposed patches. In the meantime we found out that over
> a dozen of our test servers have had the same crash, some of them
> multiple times since the kernel update.

I don't consider this a delay, it was quite speedy.

> 
> Anyways, with your proposed patches on top of 4.19.286, I couldn't
> trigger the lockdep warning anymore even in a server that without
> the fixes triggers it very reliably right after the boot. I also
> saw in another very similar server (without the fixes) that it
> took almost 17 hours to get even the lockdep warning. Maybe some
> specific BMC behavior affects this or something? Sadly, that kind
> of diminishes the value of the short duration tests, but at least
> there has so far been zero lockdep warnings with the fixes applied.
> The actual lockups are then way too unpredictable to test reliably
> in any kind of short time frame.

It does depend on what you are doing to the driver, but it sounds like
you are running the same software everywhere.  I'm not sure; I've seen
timing do strange things before.

> 
> Anyways, looking at e1891cffd4c4, it's right there where the issue
> seems to originate from, so it makes total sense to me that it does
> fix it. I was already kind of looking at it when you confirmed it.
> Thanks for pointing out also the 383035211c79 patch, it might have
> been easily missed.
> 

Ok, thank you for testing.  I'll prepare a stable kernel request.

-corey

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-06-20 11:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-19 11:11 IPMI related kernel panics since v4.19.286 Janne Huttunen (Nokia)
2023-06-19 12:23 ` Greg KH
2023-06-19 13:32   ` Corey Minyard
2023-06-19 12:52 ` Corey Minyard
2023-06-20  9:56   ` Janne Huttunen (Nokia)
2023-06-20 11:54     ` Corey Minyard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.