All of lore.kernel.org
 help / color / mirror / Atom feed
* Panic in sfc module on boot since 5.10
@ 2021-04-15  9:03 Trevor Hemsley
  2021-04-15 12:44 ` Edward Cree
  0 siblings, 1 reply; 2+ messages in thread
From: Trevor Hemsley @ 2021-04-15  9:03 UTC (permalink / raw)
  To: netdev

Hi,

I run Fedora 32 and since kernels in the 5.10 series I have been unable 
to boot without getting a panic in the sfc module. I tried on 5.11.12 
tonight and the crash still occurs. I have tried reporting this via 
Fedora channels but the silence has been deafening and I suspect this is 
an upstream issue anyway.

I'm running an Asus X570-Pro with a 3700x processor, 64GB ECC RAM and 
various nvme/SATA disks. I have a dual port Solarflare SFN6122F PCIE 
card installed that shows up in lspci as:

0b:00.0 Ethernet controller [0200]: Solarflare Communications SFC9020 
10G Ethernet Controller [1924:0803]
0b:00.1 Ethernet controller [0200]: Solarflare Communications SFC9020 
10G Ethernet Controller [1924:0803]

I have attached jpegs of the crash on the Fedora bugzilla entry 
https://bugzilla.redhat.com/show_bug.cgi?id=1924982 but since I figure 
many here won't want to download a 2.5MB attachment from a slow bugzilla 
I'll try to transcribe the relevant bits here:

BUG: kernel NULL pointer dereference, address: 0000000000000104
#PF: supervisor write acess in kernel mode
#PF: error_code(0x8002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
CPU: 0 PID: 1067 Comm: rngd Not tainted 5.11.12-100.fc32.x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME X570-PRO, 
BIOS 3405 02/01/2021
RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc]
Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 
f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 
8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c
RSP: 0000:ffff9e04c0003e78 EFLAGS: 000010246
RAX: 0000000000000001 RBX: ffff89548a9b5000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8954832c0940
RBP: 000000000000001e R08: ffff9e04c0003f50 R09: ffff89636ea2b140
R10: 0000000000000000 R11: ffffffffb9a060c0 R12: 00000000000000
R13: ffff8954832c0940 R14: ffff8954832c0940 R15: ffff89548a9b5480
FS: 00007ff835b31700(0000) GS:ffff89636ea00000(0000) knlGS:0000000000000000
CS: 0010 DA: 0000 ES: 0000 CR8: 0000000000050833
CR2: 0000000000000104 CR3: 000000011c41a000 CR4: 0000000000350ef0
Call Trace:
<IRQ>
? trigger_load_balance+0x5a/0x220
efx_poll_0xcb/0x380 [sfc]
net_rx_action+0x136/0x400
__do_softirq+0xcf/0x20f
asm_call_irq_on_stack+0x12/0x20
</IRQ>
do_softirq_own_stack_0x37/0x40
__irq_exit_rcu+0xbf/0x100
common_interrupt+0x74/0x130
? asm_common_interrupt+0x8/0x40
asm_common_interrupt+0x1e/0x40
RIP: 0033:0x7ff836732b00

I won't guarantee there are no typos in that lot since the picture is a 
bit fuzzy and so are my eyes after all that. You can find the original 
on the referenced bz above.

No problems on 5.9.16 which is the last pre-5.10 kernel available for 
F32. Everything I've tried since 5.10 goes pop.

In case it helps, this is what sfcboot reports for one of the cards (the 
other is the same)

enp11s0f0np0:
   Boot image                            Option ROM only
     Link speed                          Negotiated automatically
     Link-up delay time                  5 seconds
     Banner delay time                   2 seconds
     Boot skip delay time                5 seconds
     Boot type                           Disabled
   PF MSI-X interrupt limit              512
   SR-IOV                                Disabled
   Virtual Functions on each PF          127
   VF MSI-X interrupt limit              1

and sfupdate:

enp11s0f0np0 - MAC: 00-0x-xx-0x-xx-xx (intentionally obscured)
     Firmware version:   v7.6.9
     Controller type:    Solarflare SFC9000 family
     Controller version: v3.3.2.1000
     Boot ROM version:   v5.2.2.1004

Just prior to the crash I get a pair of messages that don't look 
particularly right but I get these on 5.9.16 too and that survives.

[    9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 
failed rc=-22 (raw=0) arg=0
[    9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 
failed rc=-22 (raw=0) arg=0

I'm not subscribed to the list so I'd be grateful for a cc on any 
replies or if I'm on entirely the wrong mailing list, feel free to let 
me know that too! I can supply any more information that would be useful 
to get this fixed.

Thanks,

Trevor

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Panic in sfc module on boot since 5.10
  2021-04-15  9:03 Panic in sfc module on boot since 5.10 Trevor Hemsley
@ 2021-04-15 12:44 ` Edward Cree
  0 siblings, 0 replies; 2+ messages in thread
From: Edward Cree @ 2021-04-15 12:44 UTC (permalink / raw)
  To: Trevor Hemsley; +Cc: Network Development

On 15/04/2021 10:03, Trevor Hemsley wrote:
> Hi,
> 
> I run Fedora 32 and since kernels in the 5.10 series I have been unable to boot without getting a panic in the sfc module. I tried on 5.11.12 tonight and the crash still occurs. I have tried reporting this via Fedora channels but the silence has been deafening
Seems Red Hat couldn't even be bothered to forward it to us :sigh:

> and I suspect this is an upstream issue anyway.
You could try building an upstream kernel and driver, and attempting to
 reproduce the issue there.  That would remove some of the unknowns.

> BUG: kernel NULL pointer dereference, address: 0000000000000104

> RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc]
> Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c
Hmm, I think this is actually <f0> 0f b1 8a 04 01 00 00 85...
 which decodes as lock cmpxchg %ecx,0x104(%rdx)
With other transcription errors fixed, the key sequence appears to be
    mov $0x1,%eax
    xor %ecx,%ecx
    lock cmpxchg %ecx,0x104(%rdx)
So we're saying "if (rdx[0x104] == 1) rdx[0x104] = 0", only atomically.
I'd *guess* this is the atomic_cmpxchg() in efx_farch_handle_tx_flush_done()
 (though it'd be nice to have your sfc.ko, with debugging symbols, to
 check for certain).
Which in turn tells us that tx_queue is NULL; this is suspicious
 because the relevant commits
    a81dcd85a7c1 ("sfc: assign TXQs without gaps")
    12804793b17c ("sfc: decouple TXQ type from label")
 happened at about the right time to cause this regression.
So now I have to go off and figure out exactly what the semantics
 of this TX flush done event's 'subdata' field are... looks like it
 probably corresponds to tx_queue->queue from
 efx_farch_flush_tx_queue().
Unfortunately, there is no simple lookup to convert from qid to
 tx_queue, because we just allocate queues as-needed in
 efx_set_channels() and don't store the reverse mapping (everything
 else works by label rather than queue, so doesn't need it).
I think the right fix is probably just to have
 efx_farch_handle_tx_flush_done() (and presumably also
 efx_farch_handle_rx_flush_done()) iterate over all queues (or at
 least all queues on the channel that received the event; but
 possibly the events might always be delivered to channel 0 rather
 than necessarily the channel that owns the queue) and perform the
 handling on any queue whose qid matches.
I will followup with a patch, hopefully some time next week if I can
 find a 6122F to test with.

> Just prior to the crash I get a pair of messages that don't look particularly right but I get these on 5.9.16 too and that survives.
> 
> [    9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0
> [    9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0

0x2a is MC_CMD_SET_LINK, which gets called in a variety of situations
 like MTU change, link advertising change (e.g. ethtool -s), and SFP+
 module hotplug.  An -EINVAL failure typically means we've asked for
 some combination of link modes that is unsupported or nonsensical; to
 investigate this further you could try with the mcdi_logging_default=1
 module parameter, which will log all MC commands and responses at
 KERN_INFO — these can then be decoded by reference to mcdi_pcol.h.
In any case this seems to be unrelated to the above issue.

-ed

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-04-15 12:44 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-15  9:03 Panic in sfc module on boot since 5.10 Trevor Hemsley
2021-04-15 12:44 ` Edward Cree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.