netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [regression] Kernel panic on resume from sleep
@ 2021-03-01 22:11 Zbynek Michl
  2021-03-03  1:44 ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Zbynek Michl @ 2021-03-01 22:11 UTC (permalink / raw)
  To: netdev

Hello,

Can anybody help me with the following kernel issue?
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983595

Do I understand it correctly that the kernel crashes due to the bug in
the alx driver?

Thank you
Zbynek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-01 22:11 [regression] Kernel panic on resume from sleep Zbynek Michl
@ 2021-03-03  1:44 ` Jakub Kicinski
  2021-03-04 12:50   ` Zbynek Michl
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2021-03-03  1:44 UTC (permalink / raw)
  To: Zbynek Michl; +Cc: netdev

On Mon, 1 Mar 2021 23:11:05 +0100 Zbynek Michl wrote:
> Hello,
> 
> Can anybody help me with the following kernel issue?
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983595
> 
> Do I understand it correctly that the kernel crashes due to the bug in
> the alx driver?

Order of calls on resume looks suspicious, can you give this a try?

diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
index 9b7f1af5f574..9e02f8864593 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -1894,13 +1894,16 @@ static int alx_resume(struct device *dev)
 
        if (!netif_running(alx->dev))
                return 0;
-       netif_device_attach(alx->dev);
 
        rtnl_lock();
        err = __alx_open(alx, true);
        rtnl_unlock();
+       if (err)
+               return err;
 
-       return err;
+       netif_device_attach(alx->dev);
+
+       return 0;
 }
 
 static SIMPLE_DEV_PM_OPS(alx_pm_ops, alx_suspend, alx_resume);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-03  1:44 ` Jakub Kicinski
@ 2021-03-04 12:50   ` Zbynek Michl
  2021-03-04 17:51     ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Zbynek Michl @ 2021-03-04 12:50 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

Looks good so far, but need to wait some more time as the issue was irregular.

Do you have any explanation why the calls disorder caused the panic
just occasionally?

Also, the same (wrong) order I can see in the 3.16 kernel code, but it
has worked fine with this kernel in all cases. So what is different in
5.10?

Thanks
Zbynek

On Wed, Mar 3, 2021 at 2:44 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 1 Mar 2021 23:11:05 +0100 Zbynek Michl wrote:
> > Hello,
> >
> > Can anybody help me with the following kernel issue?
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983595
> >
> > Do I understand it correctly that the kernel crashes due to the bug in
> > the alx driver?
>
> Order of calls on resume looks suspicious, can you give this a try?
>
> diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
> index 9b7f1af5f574..9e02f8864593 100644
> --- a/drivers/net/ethernet/atheros/alx/main.c
> +++ b/drivers/net/ethernet/atheros/alx/main.c
> @@ -1894,13 +1894,16 @@ static int alx_resume(struct device *dev)
>
>         if (!netif_running(alx->dev))
>                 return 0;
> -       netif_device_attach(alx->dev);
>
>         rtnl_lock();
>         err = __alx_open(alx, true);
>         rtnl_unlock();
> +       if (err)
> +               return err;
>
> -       return err;
> +       netif_device_attach(alx->dev);
> +
> +       return 0;
>  }
>
>  static SIMPLE_DEV_PM_OPS(alx_pm_ops, alx_suspend, alx_resume);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-04 12:50   ` Zbynek Michl
@ 2021-03-04 17:51     ` Jakub Kicinski
  2021-03-05 12:50       ` Zbynek Michl
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2021-03-04 17:51 UTC (permalink / raw)
  To: Zbynek Michl; +Cc: netdev

On Thu, 4 Mar 2021 13:50:01 +0100 Zbynek Michl wrote:
> Looks good so far, but need to wait some more time as the issue was irregular.
> 
> Do you have any explanation why the calls disorder caused the panic
> just occasionally?

Depends if kernel attempts to try to send a packet before __alx_open()
finishes. You can probably make it more likely by running trafgen, iperf
or such while suspending and resuming?

> Also, the same (wrong) order I can see in the 3.16 kernel code, but it
> has worked fine with this kernel in all cases. So what is different in
> 5.10?

At some point in between those versions the driver got modified to
allocate and free the NAPI structures dynamically.

I didn't look too closely to find out if things indeed worked 100%
correctly before, but now they will reliably crash on a NULL pointer
dereference if transmission comes before open is done.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-04 17:51     ` Jakub Kicinski
@ 2021-03-05 12:50       ` Zbynek Michl
  2021-03-05 22:18         ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Zbynek Michl @ 2021-03-05 12:50 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Thu, Mar 4, 2021 at 6:51 PM Jakub Kicinski <kuba@kernel.org> wrote:

> Depends if kernel attempts to try to send a packet before __alx_open()
> finishes. You can probably make it more likely by running trafgen, iperf
> or such while suspending and resuming?

I've tried "ping -f <GW>" first, but there was no effect - PC woke up
successfully.

Then I've tried TCP connection, like "wget -O /dev/null
https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.11.3.tar.xz" and
this kills the kernel on resume reliably.

So perphaps ICMP and TCP behave differently?

Anyway, I'm happy to be able to finally enforce the kernel crash this way :)

> I didn't look too closely to find out if things indeed worked 100%
> correctly before, but now they will reliably crash on a NULL pointer
> dereference if transmission comes before open is done.

I can confirm that the fix really works and I cannot enforce the
kernel crash anymore.

So can we merge the patch to the mainline, and ideally also to the 5.10?

Thanks
Zbynek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-05 12:50       ` Zbynek Michl
@ 2021-03-05 22:18         ` Jakub Kicinski
  2021-03-06 18:36           ` Zbynek Michl
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2021-03-05 22:18 UTC (permalink / raw)
  To: Zbynek Michl; +Cc: netdev

On Fri, 5 Mar 2021 13:50:28 +0100 Zbynek Michl wrote:
> On Thu, Mar 4, 2021 at 6:51 PM Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > Depends if kernel attempts to try to send a packet before __alx_open()
> > finishes. You can probably make it more likely by running trafgen, iperf
> > or such while suspending and resuming?  
> 
> I've tried "ping -f <GW>" first, but there was no effect - PC woke up
> successfully.
> 
> Then I've tried TCP connection, like "wget -O /dev/null
> https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.11.3.tar.xz" and
> this kills the kernel on resume reliably.
> 
> So perphaps ICMP and TCP behave differently?
> 
> Anyway, I'm happy to be able to finally enforce the kernel crash this way :)
> 
> > I didn't look too closely to find out if things indeed worked 100%
> > correctly before, but now they will reliably crash on a NULL pointer
> > dereference if transmission comes before open is done.  
> 
> I can confirm that the fix really works and I cannot enforce the
> kernel crash anymore.
> 
> So can we merge the patch to the mainline, and ideally also to the 5.10?

Great, I submitted the patch and marked it for stable inclusion!

It should percolate through the trees in a couple of weeks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [regression] Kernel panic on resume from sleep
  2021-03-05 22:18         ` Jakub Kicinski
@ 2021-03-06 18:36           ` Zbynek Michl
  0 siblings, 0 replies; 7+ messages in thread
From: Zbynek Michl @ 2021-03-06 18:36 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

> Great, I submitted the patch and marked it for stable inclusion!
>
> It should percolate through the trees in a couple of weeks.

Cool. Jakub, thank you very much for your awesome assistance!

Zbynek

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-03-06 18:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-01 22:11 [regression] Kernel panic on resume from sleep Zbynek Michl
2021-03-03  1:44 ` Jakub Kicinski
2021-03-04 12:50   ` Zbynek Michl
2021-03-04 17:51     ` Jakub Kicinski
2021-03-05 12:50       ` Zbynek Michl
2021-03-05 22:18         ` Jakub Kicinski
2021-03-06 18:36           ` Zbynek Michl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).