linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
@ 2022-08-25  6:38 Kai-Heng Feng
  2022-08-25  6:43 ` Kai-Heng Feng
  2022-08-25 17:44 ` Jakub Kicinski
  0 siblings, 2 replies; 4+ messages in thread
From: Kai-Heng Feng @ 2022-08-25  6:38 UTC (permalink / raw)
  To: siva.kallam, prashant, mchan
  Cc: Kai-Heng Feng, Josef Bacik, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
reboot hang on one Dell servers so the commit was reverted.

Someone managed to collect the AER log and it's caused by MSI:
[ 148.762067] ACPI: Preparing to enter system sleep state S5
[ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 148.803731] {1}[Hardware Error]: event severity: recoverable
[ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
[ 148.816088] {1}[Hardware Error]: section_type: PCIe error
[ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 148.829026] {1}[Hardware Error]: version: 3.0
[ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
[ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
[ 148.847309] {1}[Hardware Error]: slot: 0
[ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
[ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
[ 148.865145] {1}[Hardware Error]: class_code: 020000
[ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
[ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
[ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
[ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
[ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000

The MSI is probably raised by incoming packets, so power down the device
and disable bus mastering to stop the traffic, as user confirmed this
approach works.

In addition to that, be extra safe and cancel reset task if it's running.

Cc: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
BugLink: https://bugs.launchpad.net/bugs/1917471
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index db1e9d810b416..4fe9e2539eac1 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -18077,15 +18077,18 @@ static void tg3_shutdown(struct pci_dev *pdev)
 	struct tg3 *tp = netdev_priv(dev);
 
 	rtnl_lock();
+
+	tg3_reset_task_cancel(tp);
 	netif_device_detach(dev);
 
 	if (netif_running(dev))
 		dev_close(dev);
 
-	if (system_state == SYSTEM_POWER_OFF)
-		tg3_power_down(tp);
+	tg3_power_down(tp);
 
 	rtnl_unlock();
+
+	pci_disable_device(pdev);
 }
 
 /**
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-25  6:38 [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
@ 2022-08-25  6:43 ` Kai-Heng Feng
  2022-08-25 17:44 ` Jakub Kicinski
  1 sibling, 0 replies; 4+ messages in thread
From: Kai-Heng Feng @ 2022-08-25  6:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: siva.kallam, prashant, mchan, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

Hi Josef,

On Thu, Aug 25, 2022 at 2:39 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
> reboot hang on one Dell servers so the commit was reverted.

Can you please re-apply commit d60cd06331a3 and give this patch a try? Thanks!

Kai-Heng

>
> Someone managed to collect the AER log and it's caused by MSI:
> [ 148.762067] ACPI: Preparing to enter system sleep state S5
> [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> [ 148.803731] {1}[Hardware Error]: event severity: recoverable
> [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
> [ 148.816088] {1}[Hardware Error]: section_type: PCIe error
> [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
> [ 148.829026] {1}[Hardware Error]: version: 3.0
> [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
> [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
> [ 148.847309] {1}[Hardware Error]: slot: 0
> [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
> [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> [ 148.865145] {1}[Hardware Error]: class_code: 020000
> [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
> [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
> [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
> [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
> [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
> [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
> [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
>
> The MSI is probably raised by incoming packets, so power down the device
> and disable bus mastering to stop the traffic, as user confirmed this
> approach works.
>
> In addition to that, be extra safe and cancel reset task if it's running.
>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
> BugLink: https://bugs.launchpad.net/bugs/1917471
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/net/ethernet/broadcom/tg3.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index db1e9d810b416..4fe9e2539eac1 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -18077,15 +18077,18 @@ static void tg3_shutdown(struct pci_dev *pdev)
>         struct tg3 *tp = netdev_priv(dev);
>
>         rtnl_lock();
> +
> +       tg3_reset_task_cancel(tp);
>         netif_device_detach(dev);
>
>         if (netif_running(dev))
>                 dev_close(dev);
>
> -       if (system_state == SYSTEM_POWER_OFF)
> -               tg3_power_down(tp);
> +       tg3_power_down(tp);
>
>         rtnl_unlock();
> +
> +       pci_disable_device(pdev);
>  }
>
>  /**
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-25  6:38 [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
  2022-08-25  6:43 ` Kai-Heng Feng
@ 2022-08-25 17:44 ` Jakub Kicinski
  2022-08-26  2:33   ` Kai-Heng Feng
  1 sibling, 1 reply; 4+ messages in thread
From: Jakub Kicinski @ 2022-08-25 17:44 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: siva.kallam, prashant, mchan, Josef Bacik, David S. Miller,
	Eric Dumazet, Paolo Abeni, netdev, linux-kernel

On Thu, 25 Aug 2022 14:38:52 +0800 Kai-Heng Feng wrote:
>  	rtnl_lock();
> +
> +	tg3_reset_task_cancel(tp);

Doesn't this "task" take rtnl_lock()? Looks deadlocky.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER
  2022-08-25 17:44 ` Jakub Kicinski
@ 2022-08-26  2:33   ` Kai-Heng Feng
  0 siblings, 0 replies; 4+ messages in thread
From: Kai-Heng Feng @ 2022-08-26  2:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: siva.kallam, prashant, mchan, Josef Bacik, David S. Miller,
	Eric Dumazet, Paolo Abeni, netdev, linux-kernel

On Fri, Aug 26, 2022 at 1:44 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 25 Aug 2022 14:38:52 +0800 Kai-Heng Feng wrote:
> >       rtnl_lock();
> > +
> > +     tg3_reset_task_cancel(tp);
>
> Doesn't this "task" take rtnl_lock()? Looks deadlocky.

Thanks for the review. I sent v2 to address it.

Kai-Heng

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-08-26  2:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-25  6:38 [PATCH] tg3: Disable tg3 device on system reboot to avoid triggering AER Kai-Heng Feng
2022-08-25  6:43 ` Kai-Heng Feng
2022-08-25 17:44 ` Jakub Kicinski
2022-08-26  2:33   ` Kai-Heng Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).