All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/3] bnxt_en: AER fixes
@ 2024-04-19 18:34 Michael Chan
  2024-04-19 18:34 ` [PATCH net 1/3] bnxt_en: refactor reset close code Michael Chan
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Michael Chan @ 2024-04-19 18:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, kuba, pabeni, andrew.gospodarek

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

This patchset fixes issues in the AER recovery logic.  The first patch
refactors the code to make a shutdown function available for AER fatal
errors.  The second patch fixes the AER fatal recovery logic.  The
third patch fixes the health register logic to fix AER recovery failure
for the new P7 chips.

Michael Chan (1):
  bnxt_en: Fix error recovery for 5760X (P7) chips

Vikas Gupta (2):
  bnxt_en: refactor reset close code
  bnxt_en: Fix the PCI-AER routines

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 38 +++++++++++++++++------
 1 file changed, 28 insertions(+), 10 deletions(-)

-- 
2.30.1


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net 1/3] bnxt_en: refactor reset close code
  2024-04-19 18:34 [PATCH net 0/3] bnxt_en: AER fixes Michael Chan
@ 2024-04-19 18:34 ` Michael Chan
  2024-04-19 18:34 ` [PATCH net 2/3] bnxt_en: Fix the PCI-AER routines Michael Chan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2024-04-19 18:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, kuba, pabeni, andrew.gospodarek, Vikas Gupta

[-- Attachment #1: Type: text/plain, Size: 1487 bytes --]

From: Vikas Gupta <vikas.gupta@broadcom.com>

Introduce bnxt_fw_fatal_close() API which can be used
to stop data path and disable device when firmware
is in fatal state.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 57e61f963167..c852f87c842f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -13037,6 +13037,16 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 	bnxt_rtnl_unlock_sp(bp);
 }
 
+static void bnxt_fw_fatal_close(struct bnxt *bp)
+{
+	bnxt_tx_disable(bp);
+	bnxt_disable_napi(bp);
+	bnxt_disable_int_sync(bp);
+	bnxt_free_irq(bp);
+	bnxt_clear_int_mode(bp);
+	pci_disable_device(bp->pdev);
+}
+
 static void bnxt_fw_reset_close(struct bnxt *bp)
 {
 	bnxt_ulp_stop(bp);
@@ -13050,12 +13060,7 @@ static void bnxt_fw_reset_close(struct bnxt *bp)
 		pci_read_config_word(bp->pdev, PCI_SUBSYSTEM_ID, &val);
 		if (val == 0xffff)
 			bp->fw_reset_min_dsecs = 0;
-		bnxt_tx_disable(bp);
-		bnxt_disable_napi(bp);
-		bnxt_disable_int_sync(bp);
-		bnxt_free_irq(bp);
-		bnxt_clear_int_mode(bp);
-		pci_disable_device(bp->pdev);
+		bnxt_fw_fatal_close(bp);
 	}
 	__bnxt_close_nic(bp, true, false);
 	bnxt_vf_reps_free(bp);
-- 
2.30.1


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 2/3] bnxt_en: Fix the PCI-AER routines
  2024-04-19 18:34 [PATCH net 0/3] bnxt_en: AER fixes Michael Chan
  2024-04-19 18:34 ` [PATCH net 1/3] bnxt_en: refactor reset close code Michael Chan
@ 2024-04-19 18:34 ` Michael Chan
  2024-04-19 18:34 ` [PATCH net 3/3] bnxt_en: Fix error recovery for 5760X (P7) chips Michael Chan
  2024-04-22 13:20 ` [PATCH net 0/3] bnxt_en: AER fixes patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2024-04-19 18:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, kuba, pabeni, andrew.gospodarek, Vikas Gupta

[-- Attachment #1: Type: text/plain, Size: 2365 bytes --]

From: Vikas Gupta <vikas.gupta@broadcom.com>

We do not support two simultaneous recoveries so check for reset
flag, BNXT_STATE_IN_FW_RESET, and do not proceed with AER further.
When the pci channel state is pci_channel_io_frozen, the PCIe link
can not be trusted so we disable the traffic immediately and stop
BAR access by calling bnxt_fw_fatal_close().  BAR access after
AER fatal error can cause an NMI.

Fixes: f75d9a0aa967 ("bnxt_en: Re-write PCI BARs after PCI fatal error.")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c852f87c842f..86c1c30c70d5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15378,6 +15378,7 @@ static pci_ers_result_t bnxt_io_error_detected(struct pci_dev *pdev,
 {
 	struct net_device *netdev = pci_get_drvdata(pdev);
 	struct bnxt *bp = netdev_priv(netdev);
+	bool abort = false;
 
 	netdev_info(netdev, "PCI I/O error detected\n");
 
@@ -15386,16 +15387,27 @@ static pci_ers_result_t bnxt_io_error_detected(struct pci_dev *pdev,
 
 	bnxt_ulp_stop(bp);
 
-	if (state == pci_channel_io_perm_failure) {
+	if (test_and_set_bit(BNXT_STATE_IN_FW_RESET, &bp->state)) {
+		netdev_err(bp->dev, "Firmware reset already in progress\n");
+		abort = true;
+	}
+
+	if (abort || state == pci_channel_io_perm_failure) {
 		rtnl_unlock();
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
-	if (state == pci_channel_io_frozen)
+	/* Link is not reliable anymore if state is pci_channel_io_frozen
+	 * so we disable bus master to prevent any potential bad DMAs before
+	 * freeing kernel memory.
+	 */
+	if (state == pci_channel_io_frozen) {
 		set_bit(BNXT_STATE_PCI_CHANNEL_IO_FROZEN, &bp->state);
+		bnxt_fw_fatal_close(bp);
+	}
 
 	if (netif_running(netdev))
-		bnxt_close(netdev);
+		__bnxt_close_nic(bp, true, true);
 
 	if (pci_is_enabled(pdev))
 		pci_disable_device(pdev);
@@ -15479,6 +15491,7 @@ static pci_ers_result_t bnxt_io_slot_reset(struct pci_dev *pdev)
 	}
 
 reset_exit:
+	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
 	bnxt_clear_reservations(bp, true);
 	rtnl_unlock();
 
-- 
2.30.1


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 3/3] bnxt_en: Fix error recovery for 5760X (P7) chips
  2024-04-19 18:34 [PATCH net 0/3] bnxt_en: AER fixes Michael Chan
  2024-04-19 18:34 ` [PATCH net 1/3] bnxt_en: refactor reset close code Michael Chan
  2024-04-19 18:34 ` [PATCH net 2/3] bnxt_en: Fix the PCI-AER routines Michael Chan
@ 2024-04-19 18:34 ` Michael Chan
  2024-04-22 13:20 ` [PATCH net 0/3] bnxt_en: AER fixes patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2024-04-19 18:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, kuba, pabeni, andrew.gospodarek

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

During error recovery, such as AER fatal error slot reset, we call
bnxt_try_map_fw_health_reg() to try to get access to the health
register to determine the firmware state.  Fix
bnxt_try_map_fw_health_reg() to recognize the P7 chip correctly
and set up the health register.

This fixes this type of AER slot reset failure:

bnxt_en 0000:04:00.0: AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
bnxt_en 0000:04:00.0 enp4s0f0np0: PCI I/O error detected
bnxt_en 0000:04:00.0 bnxt_re0: Handle device suspend call
bnxt_en 0000:04:00.1 enp4s0f1np1: PCI I/O error detected
bnxt_en 0000:04:00.1 bnxt_re1: Handle device suspend call
pcieport 0000:00:02.0: AER: Root Port link has been reset (0)
bnxt_en 0000:04:00.0 enp4s0f0np0: PCI Slot Reset
bnxt_en 0000:04:00.0: enabling device (0000 -> 0002)
bnxt_en 0000:04:00.0: Firmware not ready
bnxt_en 0000:04:00.1 enp4s0f1np1: PCI Slot Reset
bnxt_en 0000:04:00.1: enabling device (0000 -> 0002)
bnxt_en 0000:04:00.1: Firmware not ready
pcieport 0000:00:02.0: AER: device recovery failed

Fixes: a432a45bdba4 ("bnxt_en: Define basic P7 macros")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 86c1c30c70d5..ed04a90a4fdd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -9089,7 +9089,7 @@ static void bnxt_try_map_fw_health_reg(struct bnxt *bp)
 					     BNXT_FW_HEALTH_WIN_BASE +
 					     BNXT_GRC_REG_CHIP_NUM);
 		}
-		if (!BNXT_CHIP_P5(bp))
+		if (!BNXT_CHIP_P5_PLUS(bp))
 			return;
 
 		status_loc = BNXT_GRC_REG_STATUS_P5 |
-- 
2.30.1


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net 0/3] bnxt_en: AER fixes
  2024-04-19 18:34 [PATCH net 0/3] bnxt_en: AER fixes Michael Chan
                   ` (2 preceding siblings ...)
  2024-04-19 18:34 ` [PATCH net 3/3] bnxt_en: Fix error recovery for 5760X (P7) chips Michael Chan
@ 2024-04-22 13:20 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-22 13:20 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, edumazet, kuba, pabeni, andrew.gospodarek

Hello:

This series was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri, 19 Apr 2024 11:34:46 -0700 you wrote:
> This patchset fixes issues in the AER recovery logic.  The first patch
> refactors the code to make a shutdown function available for AER fatal
> errors.  The second patch fixes the AER fatal recovery logic.  The
> third patch fixes the health register logic to fix AER recovery failure
> for the new P7 chips.
> 
> Michael Chan (1):
>   bnxt_en: Fix error recovery for 5760X (P7) chips
> 
> [...]

Here is the summary with links:
  - [net,1/3] bnxt_en: refactor reset close code
    https://git.kernel.org/netdev/net/c/7474b1c82be3
  - [net,2/3] bnxt_en: Fix the PCI-AER routines
    https://git.kernel.org/netdev/net/c/a1acdc226bae
  - [net,3/3] bnxt_en: Fix error recovery for 5760X (P7) chips
    https://git.kernel.org/netdev/net/c/41e54045b741

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-22 13:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-19 18:34 [PATCH net 0/3] bnxt_en: AER fixes Michael Chan
2024-04-19 18:34 ` [PATCH net 1/3] bnxt_en: refactor reset close code Michael Chan
2024-04-19 18:34 ` [PATCH net 2/3] bnxt_en: Fix the PCI-AER routines Michael Chan
2024-04-19 18:34 ` [PATCH net 3/3] bnxt_en: Fix error recovery for 5760X (P7) chips Michael Chan
2024-04-22 13:20 ` [PATCH net 0/3] bnxt_en: AER fixes patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.