All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline
@ 2016-08-31 15:11 Guilherme G. Piccoli
  2016-09-01  7:52 ` Yuval Mintz
  2016-09-02  5:50 ` David Miller
  0 siblings, 2 replies; 3+ messages in thread
From: Guilherme G. Piccoli @ 2016-08-31 15:11 UTC (permalink / raw)
  To: Yuval.Mintz, ariel.elior; +Cc: netdev, gpiccoli

When PCI error is detected, in some architectures (like PowerPC) a slot
reset is performed - the driver's error handlers are in charge of "disable"
device before the reset, and re-enable it after a successful slot reset.

There are two cases though that another path is taken on the code: if the
slot reset is not successful or if too many errors already happened in the
specific adapter (meaning that possibly the device is experiencing a HW
failure that slot reset is not able to solve), the core PCI error mechanism
(called EEH in PowerPC) will remove the adapter from the system, since it
will consider this as a permanent failure on device. In this case, a path
is taken that leads to bnx2x_chip_cleanup() calling bnx2x_reset_hw(), which
then tries to perform a HW reset on chip. This reset won't succeed since
the HW is in a fault state, which can be seen by multiple messages on
kernel log like below:

	bnx2x: [bnx2x_issue_dmae_with_comp:552(eth1)]DMAE timeout!
	bnx2x: [bnx2x_write_dmae:600(eth1)]DMAE returned failure -1

After some time, the PCI error mechanism gives up on waiting the driver's
correct removal procedure and forcibly remove the adapter from the system.
We can see soft lockup while core PCI error mechanism is waiting for driver
to accomplish the right removal process.

This patch adds a verification to avoid a chip reset whenever the function
is in PCI error state - since this case is only reached when we have a
device being removed because of a permanent failure, the HW chip reset is
not expected to work fine neither is necessary.

Also, as a minor improvement in error path, we avoid the MCP information dump
in case of non-recoverable PCI error (when adapter is about to be removed),
since it will certainly fail.

Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
---
v2: Removed the unlikely attribute on bnx2x_fw_dump_lvl() if block.

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 97e8925..fa3386b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -772,6 +772,11 @@ void bnx2x_fw_dump_lvl(struct bnx2x *bp, const char *lvl)
 		(bp->common.bc_ver & 0xff00) >> 8,
 		(bp->common.bc_ver & 0xff));
 
+	if (pci_channel_offline(bp->pdev)) {
+		BNX2X_ERR("Cannot dump MCP info while in PCI error\n");
+		return;
+	}
+
 	val = REG_RD(bp, MCP_REG_MCPR_CPU_PROGRAM_COUNTER);
 	if (val == REG_RD(bp, MCP_REG_MCPR_CPU_PROGRAM_COUNTER))
 		BNX2X_ERR("%s" "MCP PC at 0x%x\n", lvl, val);
@@ -9415,10 +9420,16 @@ unload_error:
 	/* Release IRQs */
 	bnx2x_free_irq(bp);
 
-	/* Reset the chip */
-	rc = bnx2x_reset_hw(bp, reset_code);
-	if (rc)
-		BNX2X_ERR("HW_RESET failed\n");
+	/* Reset the chip, unless PCI function is offline. If we reach this
+	 * point following a PCI error handling, it means device is really
+	 * in a bad state and we're about to remove it, so reset the chip
+	 * is not a good idea.
+	 */
+	if (!pci_channel_offline(bp->pdev)) {
+		rc = bnx2x_reset_hw(bp, reset_code);
+		if (rc)
+			BNX2X_ERR("HW_RESET failed\n");
+	}
 
 	/* Report UNLOAD_DONE to MCP */
 	bnx2x_send_unload_done(bp, keep_link);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* RE: [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline
  2016-08-31 15:11 [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline Guilherme G. Piccoli
@ 2016-09-01  7:52 ` Yuval Mintz
  2016-09-02  5:50 ` David Miller
  1 sibling, 0 replies; 3+ messages in thread
From: Yuval Mintz @ 2016-09-01  7:52 UTC (permalink / raw)
  To: Guilherme G. Piccoli; +Cc: netdev, Ariel Elior

> When PCI error is detected, in some architectures (like PowerPC) a slot reset is
> performed - the driver's error handlers are in charge of "disable"
> device before the reset, and re-enable it after a successful slot reset.
> 
> There are two cases though that another path is taken on the code: if the slot
> reset is not successful or if too many errors already happened in the specific
> adapter (meaning that possibly the device is experiencing a HW failure that slot
> reset is not able to solve), the core PCI error mechanism (called EEH in PowerPC)
> will remove the adapter from the system, since it will consider this as a
> permanent failure on device. In this case, a path is taken that leads to
> bnx2x_chip_cleanup() calling bnx2x_reset_hw(), which then tries to perform a
> HW reset on chip. This reset won't succeed since the HW is in a fault state,
> which can be seen by multiple messages on kernel log like below:
> 
> 	bnx2x: [bnx2x_issue_dmae_with_comp:552(eth1)]DMAE timeout!
> 	bnx2x: [bnx2x_write_dmae:600(eth1)]DMAE returned failure -1
> 
> After some time, the PCI error mechanism gives up on waiting the driver's
> correct removal procedure and forcibly remove the adapter from the system.
> We can see soft lockup while core PCI error mechanism is waiting for driver to
> accomplish the right removal process.
> 
> This patch adds a verification to avoid a chip reset whenever the function is in
> PCI error state - since this case is only reached when we have a device being
> removed because of a permanent failure, the HW chip reset is not expected to
> work fine neither is necessary.
> 
> Also, as a minor improvement in error path, we avoid the MCP information
> dump in case of non-recoverable PCI error (when adapter is about to be
> removed), since it will certainly fail.
> 
> Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>

Thanks.

Acked-By: Yuval Mintz <Yuval.Mintz@qlogic.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline
  2016-08-31 15:11 [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline Guilherme G. Piccoli
  2016-09-01  7:52 ` Yuval Mintz
@ 2016-09-02  5:50 ` David Miller
  1 sibling, 0 replies; 3+ messages in thread
From: David Miller @ 2016-09-02  5:50 UTC (permalink / raw)
  To: gpiccoli; +Cc: Yuval.Mintz, ariel.elior, netdev

From: "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com>
Date: Wed, 31 Aug 2016 12:11:57 -0300

> When PCI error is detected, in some architectures (like PowerPC) a slot
> reset is performed - the driver's error handlers are in charge of "disable"
> device before the reset, and re-enable it after a successful slot reset.
> 
> There are two cases though that another path is taken on the code: if the
> slot reset is not successful or if too many errors already happened in the
> specific adapter (meaning that possibly the device is experiencing a HW
> failure that slot reset is not able to solve), the core PCI error mechanism
> (called EEH in PowerPC) will remove the adapter from the system, since it
> will consider this as a permanent failure on device. In this case, a path
> is taken that leads to bnx2x_chip_cleanup() calling bnx2x_reset_hw(), which
> then tries to perform a HW reset on chip. This reset won't succeed since
> the HW is in a fault state, which can be seen by multiple messages on
> kernel log like below:
> 
> 	bnx2x: [bnx2x_issue_dmae_with_comp:552(eth1)]DMAE timeout!
> 	bnx2x: [bnx2x_write_dmae:600(eth1)]DMAE returned failure -1
> 
> After some time, the PCI error mechanism gives up on waiting the driver's
> correct removal procedure and forcibly remove the adapter from the system.
> We can see soft lockup while core PCI error mechanism is waiting for driver
> to accomplish the right removal process.
> 
> This patch adds a verification to avoid a chip reset whenever the function
> is in PCI error state - since this case is only reached when we have a
> device being removed because of a permanent failure, the HW chip reset is
> not expected to work fine neither is necessary.
> 
> Also, as a minor improvement in error path, we avoid the MCP information dump
> in case of non-recoverable PCI error (when adapter is about to be removed),
> since it will certainly fail.
> 
> Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
> ---
> v2: Removed the unlikely attribute on bnx2x_fw_dump_lvl() if block.

Applied, thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-09-02  5:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31 15:11 [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline Guilherme G. Piccoli
2016-09-01  7:52 ` Yuval Mintz
2016-09-02  5:50 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.