linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
@ 2022-08-26  0:25 Kai-Heng Feng
  2022-08-26  8:13 ` Michael Chan
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Kai-Heng Feng @ 2022-08-26  0:25 UTC (permalink / raw)
  To: siva.kallam, prashant, mchan
  Cc: Kai-Heng Feng, Josef Bacik, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
reboot hang on one Dell servers so the commit was reverted.

Someone managed to collect the AER log and it's caused by MSI:
[ 148.762067] ACPI: Preparing to enter system sleep state S5
[ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 148.803731] {1}[Hardware Error]: event severity: recoverable
[ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
[ 148.816088] {1}[Hardware Error]: section_type: PCIe error
[ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 148.829026] {1}[Hardware Error]: version: 3.0
[ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
[ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
[ 148.847309] {1}[Hardware Error]: slot: 0
[ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
[ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
[ 148.865145] {1}[Hardware Error]: class_code: 020000
[ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
[ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
[ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
[ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
[ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000

The MSI is probably raised by incoming packets, so power down the device
and disable bus mastering to stop the traffic, as user confirmed this
approach works.

In addition to that, be extra safe and cancel reset task if it's running.

Cc: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
BugLink: https://bugs.launchpad.net/bugs/1917471
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v2:
 - Move tg3_reset_task_cancel() outside of rtnl_lock() to prevent
   deadlock.

 drivers/net/ethernet/broadcom/tg3.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index db1e9d810b416..89889d8150da1 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -18076,16 +18076,20 @@ static void tg3_shutdown(struct pci_dev *pdev)
 	struct net_device *dev = pci_get_drvdata(pdev);
 	struct tg3 *tp = netdev_priv(dev);
 
+	tg3_reset_task_cancel(tp);
+
 	rtnl_lock();
+
 	netif_device_detach(dev);
 
 	if (netif_running(dev))
 		dev_close(dev);
 
-	if (system_state == SYSTEM_POWER_OFF)
-		tg3_power_down(tp);
+	tg3_power_down(tp);
 
 	rtnl_unlock();
+
+	pci_disable_device(pdev);
 }
 
 /**
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-26  0:25 [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
@ 2022-08-26  8:13 ` Michael Chan
  2022-08-26 16:18 ` Eric Dumazet
  2022-08-27  2:00 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2022-08-26  8:13 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan,
	Josef Bacik, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Netdev, open list

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

On Thu, Aug 25, 2022 at 5:25 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:

> The MSI is probably raised by incoming packets, so power down the device
> and disable bus mastering to stop the traffic, as user confirmed this
> approach works.
>
> In addition to that, be extra safe and cancel reset task if it's running.
>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
> BugLink: https://bugs.launchpad.net/bugs/1917471
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v2:
>  - Move tg3_reset_task_cancel() outside of rtnl_lock() to prevent
>    deadlock.

Looks ok to me.  Thanks.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-26  0:25 [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
  2022-08-26  8:13 ` Michael Chan
@ 2022-08-26 16:18 ` Eric Dumazet
  2022-08-26 16:38   ` Michael Chan
  2022-08-27  2:00 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2022-08-26 16:18 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: siva.kallam, prashant, mchan, Josef Bacik, David S. Miller,
	Jakub Kicinski, Paolo Abeni, netdev, LKML

On Thu, Aug 25, 2022 at 5:25 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
> reboot hang on one Dell servers so the commit was reverted.
>
> Someone managed to collect the AER log and it's caused by MSI:
> [ 148.762067] ACPI: Preparing to enter system sleep state S5
> [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> [ 148.803731] {1}[Hardware Error]: event severity: recoverable
> [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
> [ 148.816088] {1}[Hardware Error]: section_type: PCIe error
> [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
> [ 148.829026] {1}[Hardware Error]: version: 3.0
> [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
> [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
> [ 148.847309] {1}[Hardware Error]: slot: 0
> [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
> [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> [ 148.865145] {1}[Hardware Error]: class_code: 020000
> [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
> [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
> [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
> [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
> [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
> [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
> [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
>
> The MSI is probably raised by incoming packets, so power down the device
> and disable bus mastering to stop the traffic, as user confirmed this
> approach works.
>
> In addition to that, be extra safe and cancel reset task if it's running.
>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
> BugLink: https://bugs.launchpad.net/bugs/1917471
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v2:
>  - Move tg3_reset_task_cancel() outside of rtnl_lock() to prevent
>    deadlock.
>

It seems tg3_reset_task_cancel() is already called while rtnl is held/owned.
Should we worry about that ?

>  drivers/net/ethernet/broadcom/tg3.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index db1e9d810b416..89889d8150da1 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -18076,16 +18076,20 @@ static void tg3_shutdown(struct pci_dev *pdev)
>         struct net_device *dev = pci_get_drvdata(pdev);
>         struct tg3 *tp = netdev_priv(dev);
>
> +       tg3_reset_task_cancel(tp);
> +
>         rtnl_lock();
> +
>         netif_device_detach(dev);
>
>         if (netif_running(dev))
>                 dev_close(dev);
>
> -       if (system_state == SYSTEM_POWER_OFF)
> -               tg3_power_down(tp);
> +       tg3_power_down(tp);
>
>         rtnl_unlock();
> +
> +       pci_disable_device(pdev);
>  }
>
>  /**
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-26 16:18 ` Eric Dumazet
@ 2022-08-26 16:38   ` Michael Chan
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2022-08-26 16:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kai-Heng Feng, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, Josef Bacik, David S. Miller, Jakub Kicinski,
	Paolo Abeni, netdev, LKML

[-- Attachment #1: Type: text/plain, Size: 3978 bytes --]

On Fri, Aug 26, 2022 at 9:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Aug 25, 2022 at 5:25 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> > Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
> > reboot hang on one Dell servers so the commit was reverted.
> >
> > Someone managed to collect the AER log and it's caused by MSI:
> > [ 148.762067] ACPI: Preparing to enter system sleep state S5
> > [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> > [ 148.803731] {1}[Hardware Error]: event severity: recoverable
> > [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
> > [ 148.816088] {1}[Hardware Error]: section_type: PCIe error
> > [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
> > [ 148.829026] {1}[Hardware Error]: version: 3.0
> > [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
> > [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
> > [ 148.847309] {1}[Hardware Error]: slot: 0
> > [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
> > [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> > [ 148.865145] {1}[Hardware Error]: class_code: 020000
> > [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
> > [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> > [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
> > [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
> > [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
> > [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
> > [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
> >
> > The MSI is probably raised by incoming packets, so power down the device
> > and disable bus mastering to stop the traffic, as user confirmed this
> > approach works.
> >
> > In addition to that, be extra safe and cancel reset task if it's running.
> >
> > Cc: Josef Bacik <josef@toxicpanda.com>
> > Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
> > BugLink: https://bugs.launchpad.net/bugs/1917471
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> > v2:
> >  - Move tg3_reset_task_cancel() outside of rtnl_lock() to prevent
> >    deadlock.
> >
>
> It seems tg3_reset_task_cancel() is already called while rtnl is held/owned.
> Should we worry about that ?

In this shutdown code path, if we cancel it before rtnl_lock(), the
TG3_FLAG_RESET_TASK_PENDING flag will be cleared and we will not try
to cancel it again later when rtnl_lock() is held.

But I agree that calling tg3_reset_task_cancel() under rtnl_lock can
potentially deadlock if the reset_task is scheduled and we wait for it
to finish.  That logic should be fixed separately.

>
> >  drivers/net/ethernet/broadcom/tg3.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> > index db1e9d810b416..89889d8150da1 100644
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -18076,16 +18076,20 @@ static void tg3_shutdown(struct pci_dev *pdev)
> >         struct net_device *dev = pci_get_drvdata(pdev);
> >         struct tg3 *tp = netdev_priv(dev);
> >
> > +       tg3_reset_task_cancel(tp);
> > +
> >         rtnl_lock();
> > +
> >         netif_device_detach(dev);
> >
> >         if (netif_running(dev))
> >                 dev_close(dev);
> >
> > -       if (system_state == SYSTEM_POWER_OFF)
> > -               tg3_power_down(tp);
> > +       tg3_power_down(tp);
> >
> >         rtnl_unlock();
> > +
> > +       pci_disable_device(pdev);
> >  }
> >
> >  /**
> > --
> > 2.36.1
> >

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-26  0:25 [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
  2022-08-26  8:13 ` Michael Chan
  2022-08-26 16:18 ` Eric Dumazet
@ 2022-08-27  2:00 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-08-27  2:00 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: siva.kallam, prashant, mchan, josef, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel

Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 26 Aug 2022 08:25:30 +0800 you wrote:
> Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
> reboot hang on one Dell servers so the commit was reverted.
> 
> Someone managed to collect the AER log and it's caused by MSI:
> [ 148.762067] ACPI: Preparing to enter system sleep state S5
> [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> [ 148.803731] {1}[Hardware Error]: event severity: recoverable
> [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
> [ 148.816088] {1}[Hardware Error]: section_type: PCIe error
> [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
> [ 148.829026] {1}[Hardware Error]: version: 3.0
> [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
> [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
> [ 148.847309] {1}[Hardware Error]: slot: 0
> [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
> [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> [ 148.865145] {1}[Hardware Error]: class_code: 020000
> [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
> [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
> [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
> [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
> [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
> [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
> [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
> 
> [...]

Here is the summary with links:
  - [v2] tg3: Disable tg3 device on system reboot to avoid triggering AER
    https://git.kernel.org/netdev/net/c/2ca1c94ce0b6

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-27  2:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  0:25 [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
2022-08-26  8:13 ` Michael Chan
2022-08-26 16:18 ` Eric Dumazet
2022-08-26 16:38   ` Michael Chan
2022-08-27  2:00 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).