All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tg3: Prevent system hang during repeated EEH errors.
@ 2013-06-14 21:15 Nithin Nayak Sujir
  2013-06-17 18:28 ` Benjamin Poirier
  0 siblings, 1 reply; 5+ messages in thread
From: Nithin Nayak Sujir @ 2013-06-14 21:15 UTC (permalink / raw)
  To: davem; +Cc: netdev, Michael Chan, Nithin Nayak Sujir

From: Michael Chan <mchan@broadcom.com>

The current tg3 code assumes the pci_error_handlers to be always called
in sequence.  In particular, during ->error_detected(), NAPI is disabled
and the device is shutdown.  The device is later reset and NAPI
re-enabled in ->slot_reset() and ->resume().

In EEH, if more than 6 errors are detected in a hour, only
->error_detected() will be called.  This will leave the driver in an
inconsistent state as NAPI is disabled but netif_running state is still
true.  When the device is later closed, we'll try to disable NAPI again
and it will loop forever.

We fix this by closing the device if we encounter any error conditions
during the normal sequence of the pci_error_handlers.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 28a645f..bfe1831 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -17747,10 +17747,13 @@ static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev,
 	tg3_full_unlock(tp);
 
 done:
-	if (state == pci_channel_io_perm_failure)
+	if (state == pci_channel_io_perm_failure) {
+		tg3_napi_enable(tp);
+		dev_close(netdev);
 		err = PCI_ERS_RESULT_DISCONNECT;
-	else
+	} else {
 		pci_disable_device(pdev);
+	}
 
 	rtnl_unlock();
 
@@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
 	rc = PCI_ERS_RESULT_RECOVERED;
 
 done:
+	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
+		tg3_napi_enable(tp);
+		dev_close(netdev);
+	}
 	rtnl_unlock();
 
 	return rc;
@@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
 	if (err) {
 		tg3_full_unlock(tp);
 		netdev_err(netdev, "Cannot restart hardware after reset.\n");
+		tg3_napi_enable(tp);
+		dev_close(netdev);
 		goto done;
 	}
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] tg3: Prevent system hang during repeated EEH errors.
  2013-06-14 21:15 [PATCH net-next] tg3: Prevent system hang during repeated EEH errors Nithin Nayak Sujir
@ 2013-06-17 18:28 ` Benjamin Poirier
  2013-06-17 18:56   ` Benjamin Poirier
  2013-06-17 18:59   ` Michael Chan
  0 siblings, 2 replies; 5+ messages in thread
From: Benjamin Poirier @ 2013-06-17 18:28 UTC (permalink / raw)
  To: Nithin Nayak Sujir; +Cc: davem, netdev, Michael Chan

On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
[...]
> @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
>  	rc = PCI_ERS_RESULT_RECOVERED;
>  
>  done:
> +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> +		tg3_napi_enable(tp);
> +		dev_close(netdev);
> +	}
>  	rtnl_unlock();
>  
>  	return rc;
> @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
>  	if (err) {
>  		tg3_full_unlock(tp);
>  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> +		tg3_napi_enable(tp);
> +		dev_close(netdev);
>  		goto done;
>  	}

Are these two hunks needed?
1) These functions do not call tg3_netif_stop() or tg3_napi_disable()
2) an error in tg3_io_resume() does not trigger device removal in
handle_eeh_events(). In fact the ->resume callback has no return value.

>  
> -- 
> 1.8.1.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] tg3: Prevent system hang during repeated EEH errors.
  2013-06-17 18:28 ` Benjamin Poirier
@ 2013-06-17 18:56   ` Benjamin Poirier
  2013-06-17 19:11     ` Michael Chan
  2013-06-17 18:59   ` Michael Chan
  1 sibling, 1 reply; 5+ messages in thread
From: Benjamin Poirier @ 2013-06-17 18:56 UTC (permalink / raw)
  To: Nithin Nayak Sujir; +Cc: davem, netdev, Michael Chan

On 2013/06/17 14:28, Benjamin Poirier wrote:
> On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
> [...]
> > @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
> >  	rc = PCI_ERS_RESULT_RECOVERED;
> >  
> >  done:
> > +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> > +	}
> >  	rtnl_unlock();
> >  
> >  	return rc;
> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}
> 
> Are these two hunks needed?
> 1) These functions do not call tg3_netif_stop() or tg3_napi_disable()

Ok, I see why this is relevant, since the slot_reset and resume
callbacks are always called after the error_detected callback.

> 2) an error in tg3_io_resume() does not trigger device removal in
> handle_eeh_events(). In fact the ->resume callback has no return value.

Nevertheless, this hunk

> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}

duplicates the error handling code already in tg3_restart_hw().

> 
> >  
> > -- 
> > 1.8.1.4
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] tg3: Prevent system hang during repeated EEH errors.
  2013-06-17 18:28 ` Benjamin Poirier
  2013-06-17 18:56   ` Benjamin Poirier
@ 2013-06-17 18:59   ` Michael Chan
  1 sibling, 0 replies; 5+ messages in thread
From: Michael Chan @ 2013-06-17 18:59 UTC (permalink / raw)
  To: Benjamin Poirier; +Cc: Nithin Nayak Sujir, davem, netdev

On Mon, 2013-06-17 at 14:28 -0400, Benjamin Poirier wrote: 
> On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
> [...]
> > @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
> >  	rc = PCI_ERS_RESULT_RECOVERED;
> >  
> >  done:
> > +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> > +	}
> >  	rtnl_unlock();
> >  
> >  	return rc;
> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}
> 
> Are these two hunks needed?
> 1) These functions do not call tg3_netif_stop() or tg3_napi_disable()
> 2) an error in tg3_io_resume() does not trigger device removal in
> handle_eeh_events(). In fact the ->resume callback has no return value.
> 

The normal sequence is:

error_detected(), slot_reset(), resume()

In error_detected(), chip will be shutdown and NAPI will be disabled if
netif_running state is true.  When everything works correctly, the chip
will be re-enabled in resume() and NAPI re-enabled.  If we run into any
error in this sequence, the sequence will not complete normally.  In
this case, if netif_running state is true, we know that the NAPI state
has been disabled earlier in error_detected(), and we need to properly
close the device.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] tg3: Prevent system hang during repeated EEH errors.
  2013-06-17 18:56   ` Benjamin Poirier
@ 2013-06-17 19:11     ` Michael Chan
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Chan @ 2013-06-17 19:11 UTC (permalink / raw)
  To: Benjamin Poirier; +Cc: Nithin Nayak Sujir, davem, netdev

On Mon, 2013-06-17 at 14:56 -0400, Benjamin Poirier wrote:
> Nevertheless, this hunk
> 
> > > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> > >     if (err) {
> > >             tg3_full_unlock(tp);
> > >             netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > > +           tg3_napi_enable(tp);
> > > +           dev_close(netdev);
> > >             goto done;
> > >     }
> 
> duplicates the error handling code already in tg3_restart_hw(). 

Very good point.  We'll modify the patch and re-send.  Thanks a lot
Benjamin.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-06-17 19:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-14 21:15 [PATCH net-next] tg3: Prevent system hang during repeated EEH errors Nithin Nayak Sujir
2013-06-17 18:28 ` Benjamin Poirier
2013-06-17 18:56   ` Benjamin Poirier
2013-06-17 19:11     ` Michael Chan
2013-06-17 18:59   ` Michael Chan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.