* [PATCH] tg3: add new module param to force device power down on reboot
@ 2024-01-09 19:45 Andrea Fois
2024-01-09 20:31 ` Heiner Kallweit
0 siblings, 1 reply; 7+ messages in thread
From: Andrea Fois @ 2024-01-09 19:45 UTC (permalink / raw)
Cc: andrea.fois, Pavan Chebbi, Michael Chan, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, George Shuklin,
netdev, linux-kernel
The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
device on system reboot to avoid triggering AER") but was reintroduced
by commit 9fc3bc764334 ("tg3: power down device only on
SYSTEM_POWER_OFF").
The problem described in #1917471 is still consistently replicable on
reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
(i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
was committed.
The problem is detected also by the Lifecycle controller and logged as
a PCI Bus Error for the device.
As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
opposite strategies, a new module param "force_pwr_down_on_reboot"
<bool> is introduced to fix both scenarios:
force_pwr_down_on_reboot = 0/N/n = disable, keep the current
behavior, don't force dev
power down on reboot
force_pwr_down_on_reboot = 1/Y/y = enable, revert to the
behavior of 2ca1c94ce0b6,
force dev power down on reboot
Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
Signed-off-by: Andrea Fois <andrea.fois@eventsense.it>
---
drivers/net/ethernet/broadcom/tg3.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index f52830dfb26a..287786357c9b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -233,6 +233,12 @@ static int tg3_debug = -1; /* -1 == use TG3_DEF_MSG_ENABLE as value */
module_param(tg3_debug, int, 0);
MODULE_PARM_DESC(tg3_debug, "Tigon3 bitmapped debugging message enable value");
+static bool force_pwr_down_on_reboot; /* false == Don't force the power down of
+ * the device during reboot, only on SYSTEM_POWER_OFF
+ */
+module_param(force_pwr_down_on_reboot, bool, 0x644);
+MODULE_PARM_DESC(force_pwr_down_on_reboot, "Tigon3 force power down of the device on reboot enable value");
+
#define TG3_DRV_DATA_FLAG_10_100_ONLY 0x0001
#define TG3_DRV_DATA_FLAG_5705_10_100 0x0002
@@ -18197,7 +18203,7 @@ static void tg3_shutdown(struct pci_dev *pdev)
if (netif_running(dev))
dev_close(dev);
- if (system_state == SYSTEM_POWER_OFF)
+ if (system_state == SYSTEM_POWER_OFF || force_pwr_down_on_reboot)
tg3_power_down(tp);
rtnl_unlock();
--
2.40.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-09 19:45 [PATCH] tg3: add new module param to force device power down on reboot Andrea Fois
@ 2024-01-09 20:31 ` Heiner Kallweit
2024-01-10 4:12 ` Pavan Chebbi
0 siblings, 1 reply; 7+ messages in thread
From: Heiner Kallweit @ 2024-01-09 20:31 UTC (permalink / raw)
To: Andrea Fois
Cc: Pavan Chebbi, Michael Chan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, George Shuklin, netdev,
linux-kernel
On 09.01.2024 20:45, Andrea Fois wrote:
> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
> device on system reboot to avoid triggering AER") but was reintroduced
> by commit 9fc3bc764334 ("tg3: power down device only on
> SYSTEM_POWER_OFF").
>
> The problem described in #1917471 is still consistently replicable on
> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
> was committed.
>
> The problem is detected also by the Lifecycle controller and logged as
> a PCI Bus Error for the device.
>
> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
> opposite strategies, a new module param "force_pwr_down_on_reboot"
> <bool> is introduced to fix both scenarios:
>
Adding module parameters is discouraged. What I see could try:
- limit 9fc3bc764334 to the specific machine type mentioned in the
commit message (based DMI info)
- 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
Based on the commit description disabling the device might be sufficient.
> force_pwr_down_on_reboot = 0/N/n = disable, keep the current
> behavior, don't force dev
> power down on reboot
>
> force_pwr_down_on_reboot = 1/Y/y = enable, revert to the
> behavior of 2ca1c94ce0b6,
> force dev power down on reboot
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Andrea Fois <andrea.fois@eventsense.it>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index f52830dfb26a..287786357c9b 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -233,6 +233,12 @@ static int tg3_debug = -1; /* -1 == use TG3_DEF_MSG_ENABLE as value */
> module_param(tg3_debug, int, 0);
> MODULE_PARM_DESC(tg3_debug, "Tigon3 bitmapped debugging message enable value");
>
> +static bool force_pwr_down_on_reboot; /* false == Don't force the power down of
> + * the device during reboot, only on SYSTEM_POWER_OFF
> + */
> +module_param(force_pwr_down_on_reboot, bool, 0x644);
> +MODULE_PARM_DESC(force_pwr_down_on_reboot, "Tigon3 force power down of the device on reboot enable value");
> +
> #define TG3_DRV_DATA_FLAG_10_100_ONLY 0x0001
> #define TG3_DRV_DATA_FLAG_5705_10_100 0x0002
>
> @@ -18197,7 +18203,7 @@ static void tg3_shutdown(struct pci_dev *pdev)
> if (netif_running(dev))
> dev_close(dev);
>
> - if (system_state == SYSTEM_POWER_OFF)
> + if (system_state == SYSTEM_POWER_OFF || force_pwr_down_on_reboot)
> tg3_power_down(tp);
>
> rtnl_unlock();
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-09 20:31 ` Heiner Kallweit
@ 2024-01-10 4:12 ` Pavan Chebbi
2024-01-10 7:09 ` Heiner Kallweit
0 siblings, 1 reply; 7+ messages in thread
From: Pavan Chebbi @ 2024-01-10 4:12 UTC (permalink / raw)
To: Heiner Kallweit
Cc: Andrea Fois, Michael Chan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, George Shuklin, netdev,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1753 bytes --]
On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 09.01.2024 20:45, Andrea Fois wrote:
> > The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
> > device on system reboot to avoid triggering AER") but was reintroduced
> > by commit 9fc3bc764334 ("tg3: power down device only on
> > SYSTEM_POWER_OFF").
> >
> > The problem described in #1917471 is still consistently replicable on
> > reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
> > (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
> > was committed.
> >
> > The problem is detected also by the Lifecycle controller and logged as
> > a PCI Bus Error for the device.
> >
> > As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
> > opposite strategies, a new module param "force_pwr_down_on_reboot"
> > <bool> is introduced to fix both scenarios:
> >
> Adding module parameters is discouraged. What I see could try:
Ack.
>
> - limit 9fc3bc764334 to the specific machine type mentioned in the
> commit message (based DMI info)
> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
> Based on the commit description disabling the device might be sufficient.
I think the second suggestion could be a better solution. Helps to
solve the issue 9fc3bc764334 is trying to fix.
But I am not sure how easy it is to test. As I recall, Goerge was
unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
test his patch for regression.
We did discuss the risk of this regression.
https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@gmail.com/
Unfortunately, looks like it has come true :(
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-10 4:12 ` Pavan Chebbi
@ 2024-01-10 7:09 ` Heiner Kallweit
2024-01-10 7:17 ` Michael Chan
0 siblings, 1 reply; 7+ messages in thread
From: Heiner Kallweit @ 2024-01-10 7:09 UTC (permalink / raw)
To: Pavan Chebbi
Cc: Andrea Fois, Michael Chan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, George Shuklin, netdev,
linux-kernel
On 10.01.2024 05:12, Pavan Chebbi wrote:
> On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 09.01.2024 20:45, Andrea Fois wrote:
>>> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
>>> device on system reboot to avoid triggering AER") but was reintroduced
>>> by commit 9fc3bc764334 ("tg3: power down device only on
>>> SYSTEM_POWER_OFF").
>>>
>>> The problem described in #1917471 is still consistently replicable on
>>> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
>>> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
>>> was committed.
>>>
>>> The problem is detected also by the Lifecycle controller and logged as
>>> a PCI Bus Error for the device.
>>>
>>> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
>>> opposite strategies, a new module param "force_pwr_down_on_reboot"
>>> <bool> is introduced to fix both scenarios:
>>>
>> Adding module parameters is discouraged. What I see could try:
> Ack.
>>
>> - limit 9fc3bc764334 to the specific machine type mentioned in the
>> commit message (based DMI info)
>> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
>> Based on the commit description disabling the device might be sufficient.
>
> I think the second suggestion could be a better solution. Helps to
> solve the issue 9fc3bc764334 is trying to fix.
> But I am not sure how easy it is to test. As I recall, Goerge was
> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> test his patch for regression.
> We did discuss the risk of this regression.
> https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@gmail.com/
> Unfortunately, looks like it has come true :(
If the culprit is an unexpected MSI interrupt, then you may also address
this directly by disabling interrupts in the device. tg3_stop() may be
a candidate here.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-10 7:09 ` Heiner Kallweit
@ 2024-01-10 7:17 ` Michael Chan
2024-01-10 7:34 ` Heiner Kallweit
0 siblings, 1 reply; 7+ messages in thread
From: Michael Chan @ 2024-01-10 7:17 UTC (permalink / raw)
To: Heiner Kallweit
Cc: Pavan Chebbi, Andrea Fois, Michael Chan, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, George Shuklin,
netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 909 bytes --]
On Tue, Jan 9, 2024 at 11:09 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 10.01.2024 05:12, Pavan Chebbi wrote:
> > I think the second suggestion could be a better solution. Helps to
> > solve the issue 9fc3bc764334 is trying to fix.
> > But I am not sure how easy it is to test. As I recall, Goerge was
> > unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> > test his patch for regression.
> > We did discuss the risk of this regression.
> > https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@gmail.com/
> > Unfortunately, looks like it has come true :(
>
> If the culprit is an unexpected MSI interrupt, then you may also address
> this directly by disabling interrupts in the device. tg3_stop() may be
> a candidate here.
>
We already call dev_close() which will call tg3_close() -> tg3_stop()
a few lines above.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-10 7:17 ` Michael Chan
@ 2024-01-10 7:34 ` Heiner Kallweit
2024-01-10 17:01 ` Michael Chan
0 siblings, 1 reply; 7+ messages in thread
From: Heiner Kallweit @ 2024-01-10 7:34 UTC (permalink / raw)
To: Michael Chan
Cc: Pavan Chebbi, Andrea Fois, Michael Chan, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, George Shuklin,
netdev, linux-kernel
On 10.01.2024 08:17, Michael Chan wrote:
> On Tue, Jan 9, 2024 at 11:09 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 10.01.2024 05:12, Pavan Chebbi wrote:
>>> I think the second suggestion could be a better solution. Helps to
>>> solve the issue 9fc3bc764334 is trying to fix.
>>> But I am not sure how easy it is to test. As I recall, Goerge was
>>> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
>>> test his patch for regression.
>>> We did discuss the risk of this regression.
>>> https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@gmail.com/
>>> Unfortunately, looks like it has come true :(
>>
>> If the culprit is an unexpected MSI interrupt, then you may also address
>> this directly by disabling interrupts in the device. tg3_stop() may be
>> a candidate here.
>>
>
> We already call dev_close() which will call tg3_close() -> tg3_stop()
> a few lines above.
tg3_stop() calls tg3_disable_ints(), so I wonder how a MSI interrupt can
occur after that. Does tg3_disable_ints() disable interrupts synchronously?
Or maybe some kind of commit is needed?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tg3: add new module param to force device power down on reboot
2024-01-10 7:34 ` Heiner Kallweit
@ 2024-01-10 17:01 ` Michael Chan
0 siblings, 0 replies; 7+ messages in thread
From: Michael Chan @ 2024-01-10 17:01 UTC (permalink / raw)
To: Heiner Kallweit
Cc: Pavan Chebbi, Andrea Fois, Michael Chan, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, George Shuklin,
netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 533 bytes --]
On Tue, Jan 9, 2024 at 11:34 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 10.01.2024 08:17, Michael Chan wrote:
> > We already call dev_close() which will call tg3_close() -> tg3_stop()
> > a few lines above.
>
> tg3_stop() calls tg3_disable_ints(), so I wonder how a MSI interrupt can
> occur after that. Does tg3_disable_ints() disable interrupts synchronously?
> Or maybe some kind of commit is needed?
>
Yes, it is synchronous. The tg3_full_lock() call before
tg3_disable_ints() makes it synchronous.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-01-10 17:01 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-09 19:45 [PATCH] tg3: add new module param to force device power down on reboot Andrea Fois
2024-01-09 20:31 ` Heiner Kallweit
2024-01-10 4:12 ` Pavan Chebbi
2024-01-10 7:09 ` Heiner Kallweit
2024-01-10 7:17 ` Michael Chan
2024-01-10 7:34 ` Heiner Kallweit
2024-01-10 17:01 ` Michael Chan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).