netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/2] bnxt_en: Two bug fixes.
@ 2020-09-06  2:55 Michael Chan
  2020-09-06  2:55 ` [PATCH net 1/2] bnxt_en: Avoid sending firmware messages when AER error is detected Michael Chan
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Michael Chan @ 2020-09-06  2:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, kuba

The first patch fixes AER recovery by reducing the time from several
minutes to a more reasonable 20 - 30 seconds.  The second patch fixes
a possible NULL pointer crash during firmware reset.

Please queue for -stable also.  Thanks.

Vasundhara Volam (2):
  bnxt_en: Avoid sending firmware messages when AER error is detected.
  bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 13 +++++++------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  4 ++++
 2 files changed, 11 insertions(+), 6 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net 1/2] bnxt_en: Avoid sending firmware messages when AER error is detected.
  2020-09-06  2:55 [PATCH net 0/2] bnxt_en: Two bug fixes Michael Chan
@ 2020-09-06  2:55 ` Michael Chan
  2020-09-06  2:55 ` [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task() Michael Chan
  2020-09-07 17:16 ` [PATCH net 0/2] bnxt_en: Two bug fixes Jakub Kicinski
  2 siblings, 0 replies; 9+ messages in thread
From: Michael Chan @ 2020-09-06  2:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, kuba, Vasundhara Volam

From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>

When the driver goes through PCIe AER reset in error state, all
firmware messages will timeout because the PCIe bus is no longer
accessible.  This can lead to AER reset taking many minutes to
complete as each firmware command takes time to timeout.

Define a new macro BNXT_NO_FW_ACCESS() to skip these firmware messages
when either firmware is in fatal error state or when
pci_channel_offline() is true.  It now takes a more reasonable 20 to
30 seconds to complete AER recovery.

Fixes: b4fff2079d10 ("bnxt_en: Do not send firmware messages if firmware is in error state.")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 4 ++++
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b167066..619eb55 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4305,7 +4305,7 @@ static int bnxt_hwrm_do_send_msg(struct bnxt *bp, void *msg, u32 msg_len,
 	u32 bar_offset = BNXT_GRCPF_REG_CHIMP_COMM;
 	u16 dst = BNXT_HWRM_CHNL_CHIMP;
 
-	if (test_bit(BNXT_STATE_FW_FATAL_COND, &bp->state))
+	if (BNXT_NO_FW_ACCESS(bp))
 		return -EBUSY;
 
 	if (msg_len > BNXT_HWRM_MAX_REQ_LEN) {
@@ -5723,7 +5723,7 @@ static int hwrm_ring_free_send_msg(struct bnxt *bp,
 	struct hwrm_ring_free_output *resp = bp->hwrm_cmd_resp_addr;
 	u16 error_code;
 
-	if (test_bit(BNXT_STATE_FW_FATAL_COND, &bp->state))
+	if (BNXT_NO_FW_ACCESS(bp))
 		return 0;
 
 	bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_RING_FREE, cmpl_ring_id, -1);
@@ -7817,7 +7817,7 @@ static int bnxt_set_tpa(struct bnxt *bp, bool set_tpa)
 
 	if (set_tpa)
 		tpa_flags = bp->flags & BNXT_FLAG_TPA;
-	else if (test_bit(BNXT_STATE_FW_FATAL_COND, &bp->state))
+	else if (BNXT_NO_FW_ACCESS(bp))
 		return 0;
 	for (i = 0; i < bp->nr_vnics; i++) {
 		rc = bnxt_hwrm_vnic_set_tpa(bp, i, tpa_flags);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 5a13eb6..0ef89da 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1737,6 +1737,10 @@ struct bnxt {
 #define BNXT_STATE_FW_FATAL_COND	6
 #define BNXT_STATE_DRV_REGISTERED	7
 
+#define BNXT_NO_FW_ACCESS(bp)					\
+	(test_bit(BNXT_STATE_FW_FATAL_COND, &(bp)->state) ||	\
+	 pci_channel_offline((bp)->pdev))
+
 	struct bnxt_irq	*irq_tbl;
 	int			total_irqs;
 	u8			mac_addr[ETH_ALEN];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-06  2:55 [PATCH net 0/2] bnxt_en: Two bug fixes Michael Chan
  2020-09-06  2:55 ` [PATCH net 1/2] bnxt_en: Avoid sending firmware messages when AER error is detected Michael Chan
@ 2020-09-06  2:55 ` Michael Chan
  2020-09-06 19:25   ` Jakub Kicinski
  2020-09-07 17:16 ` [PATCH net 0/2] bnxt_en: Two bug fixes Jakub Kicinski
  2 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2020-09-06  2:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, kuba, Vasundhara Volam

From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>

bnxt_fw_reset_task() which runs from a workqueue can race with
bnxt_remove_one().  For example, if firmware reset and VF FLR are
happening at about the same time.

bnxt_remove_one() already cancels the workqueue and waits for it
to finish, but we need to do this earlier before the devlink
reporters are destroyed.  This will guarantee that
the devlink reporters will always be valid when bnxt_fw_reset_task()
is still running.

Fixes: b148bb238c02 ("bnxt_en: Fix possible crash in bnxt_fw_reset_task().")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 619eb55..8eb73fe 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11779,6 +11779,10 @@ static void bnxt_remove_one(struct pci_dev *pdev)
 	if (BNXT_PF(bp))
 		bnxt_sriov_disable(bp);
 
+	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
+	bnxt_cancel_sp_work(bp);
+	bp->sp_event = 0;
+
 	bnxt_dl_fw_reporters_destroy(bp, true);
 	if (BNXT_PF(bp))
 		devlink_port_type_clear(&bp->dl_port);
@@ -11786,9 +11790,6 @@ static void bnxt_remove_one(struct pci_dev *pdev)
 	unregister_netdev(dev);
 	bnxt_dl_unregister(bp);
 	bnxt_shutdown_tc(bp);
-	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
-	bnxt_cancel_sp_work(bp);
-	bp->sp_event = 0;
 
 	bnxt_clear_int_mode(bp);
 	bnxt_hwrm_func_drv_unrgtr(bp);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-06  2:55 ` [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task() Michael Chan
@ 2020-09-06 19:25   ` Jakub Kicinski
  2020-09-06 22:07     ` Michael Chan
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2020-09-06 19:25 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, Vasundhara Volam

On Sat,  5 Sep 2020 22:55:37 -0400 Michael Chan wrote:
> From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> 
> bnxt_fw_reset_task() which runs from a workqueue can race with
> bnxt_remove_one().  For example, if firmware reset and VF FLR are
> happening at about the same time.
> 
> bnxt_remove_one() already cancels the workqueue and waits for it
> to finish, but we need to do this earlier before the devlink
> reporters are destroyed.  This will guarantee that
> the devlink reporters will always be valid when bnxt_fw_reset_task()
> is still running.
> 
> Fixes: b148bb238c02 ("bnxt_en: Fix possible crash in bnxt_fw_reset_task().")
> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 619eb55..8eb73fe 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -11779,6 +11779,10 @@ static void bnxt_remove_one(struct pci_dev *pdev)
>  	if (BNXT_PF(bp))
>  		bnxt_sriov_disable(bp);
>  
> +	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
> +	bnxt_cancel_sp_work(bp);
> +	bp->sp_event = 0;
> +
>  	bnxt_dl_fw_reporters_destroy(bp, true);
>  	if (BNXT_PF(bp))
>  		devlink_port_type_clear(&bp->dl_port);
> @@ -11786,9 +11790,6 @@ static void bnxt_remove_one(struct pci_dev *pdev)
>  	unregister_netdev(dev);
>  	bnxt_dl_unregister(bp);
>  	bnxt_shutdown_tc(bp);
> -	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
> -	bnxt_cancel_sp_work(bp);
> -	bp->sp_event = 0;
>  
>  	bnxt_clear_int_mode(bp);
>  	bnxt_hwrm_func_drv_unrgtr(bp);

devlink can itself scheduler a recovery via:

  bnxt_fw_fatal_recover() -> bnxt_fw_reset()

no? Maybe don't make the devlink recovery path need to go via a
workqueue?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-06 19:25   ` Jakub Kicinski
@ 2020-09-06 22:07     ` Michael Chan
  2020-09-07  3:13       ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2020-09-06 22:07 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David Miller, Netdev, Vasundhara Volam

On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> devlink can itself scheduler a recovery via:
>
>   bnxt_fw_fatal_recover() -> bnxt_fw_reset()
>

Yes, this is how it is initiated when we call devlink_health_report()
to report the error condition.  From bnxt_fw_reset(), we use a
workqueue because we have to go through many states, requiring
sleeping/polling to transition through the states.

> no? Maybe don't make the devlink recovery path need to go via a
> workqueue?

Current implementation is going through a work queue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-06 22:07     ` Michael Chan
@ 2020-09-07  3:13       ` Jakub Kicinski
  2020-09-07  3:48         ` Michael Chan
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2020-09-07  3:13 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, Netdev, Vasundhara Volam

On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote:
> On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > devlink can itself scheduler a recovery via:
> >
> >   bnxt_fw_fatal_recover() -> bnxt_fw_reset()
> >  
> 
> Yes, this is how it is initiated when we call devlink_health_report()
> to report the error condition.  From bnxt_fw_reset(), we use a
> workqueue because we have to go through many states, requiring
> sleeping/polling to transition through the states.
> 
> > no? Maybe don't make the devlink recovery path need to go via a
> > workqueue?  
> 
> Current implementation is going through a work queue.

What I'm saying is the code looks like this after this patch:

+	clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
+	bnxt_cancel_sp_work(bp);
+	bp->sp_event = 0;
+
 	bnxt_dl_fw_reporters_destroy(bp, true);

It cancels the work, _then_ destroys the reporter. But I think the
reported can be used to schedule a recovery from command line. So the
work may get re-scheduled after it has been canceled.

devlink_nl_cmd_health_reporter_recover_doit() -> bnxt_fw_fatal_recover() ->
  bnxt_fw_reset() -> bnxt_queue_fw_reset_work()

What am I missing?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-07  3:13       ` Jakub Kicinski
@ 2020-09-07  3:48         ` Michael Chan
  2020-09-07 17:13           ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2020-09-07  3:48 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David Miller, Netdev, Vasundhara Volam

On Sun, Sep 6, 2020 at 8:13 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote:
> > On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > devlink can itself scheduler a recovery via:
> > >
> > >   bnxt_fw_fatal_recover() -> bnxt_fw_reset()
> > >
> >
> > Yes, this is how it is initiated when we call devlink_health_report()
> > to report the error condition.  From bnxt_fw_reset(), we use a
> > workqueue because we have to go through many states, requiring
> > sleeping/polling to transition through the states.
> >
> > > no? Maybe don't make the devlink recovery path need to go via a
> > > workqueue?
> >
> > Current implementation is going through a work queue.
>
> What I'm saying is the code looks like this after this patch:
>
> +       clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
> +       bnxt_cancel_sp_work(bp);
> +       bp->sp_event = 0;
> +
>         bnxt_dl_fw_reporters_destroy(bp, true);
>
> It cancels the work, _then_ destroys the reporter. But I think the
> reported can be used to schedule a recovery from command line. So the
> work may get re-scheduled after it has been canceled.

bnxt_en does not support recovery from the command line.  We return
-EOPNOTSUPP when it comes from the command line.

Recovery has to be triggered from a firmware reported error or a
driver detected error.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task()
  2020-09-07  3:48         ` Michael Chan
@ 2020-09-07 17:13           ` Jakub Kicinski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2020-09-07 17:13 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, Netdev, Vasundhara Volam

On Sun, 6 Sep 2020 20:48:04 -0700 Michael Chan wrote:
> On Sun, Sep 6, 2020 at 8:13 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote:  
> > > On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > >
> > > > devlink can itself scheduler a recovery via:
> > > >
> > > >   bnxt_fw_fatal_recover() -> bnxt_fw_reset()
> > > >  
> > >
> > > Yes, this is how it is initiated when we call devlink_health_report()
> > > to report the error condition.  From bnxt_fw_reset(), we use a
> > > workqueue because we have to go through many states, requiring
> > > sleeping/polling to transition through the states.
> > >  
> > > > no? Maybe don't make the devlink recovery path need to go via a
> > > > workqueue?  
> > >
> > > Current implementation is going through a work queue.  
> >
> > What I'm saying is the code looks like this after this patch:
> >
> > +       clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state);
> > +       bnxt_cancel_sp_work(bp);
> > +       bp->sp_event = 0;
> > +
> >         bnxt_dl_fw_reporters_destroy(bp, true);
> >
> > It cancels the work, _then_ destroys the reporter. But I think the
> > reported can be used to schedule a recovery from command line. So the
> > work may get re-scheduled after it has been canceled.  
> 
> bnxt_en does not support recovery from the command line.  We return
> -EOPNOTSUPP when it comes from the command line.
> 
> Recovery has to be triggered from a firmware reported error or a
> driver detected error.

I see it now, thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net 0/2] bnxt_en: Two bug fixes.
  2020-09-06  2:55 [PATCH net 0/2] bnxt_en: Two bug fixes Michael Chan
  2020-09-06  2:55 ` [PATCH net 1/2] bnxt_en: Avoid sending firmware messages when AER error is detected Michael Chan
  2020-09-06  2:55 ` [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task() Michael Chan
@ 2020-09-07 17:16 ` Jakub Kicinski
  2 siblings, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2020-09-07 17:16 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev

On Sat,  5 Sep 2020 22:55:35 -0400 Michael Chan wrote:
> The first patch fixes AER recovery by reducing the time from several
> minutes to a more reasonable 20 - 30 seconds.  The second patch fixes
> a possible NULL pointer crash during firmware reset.
> 
> Please queue for -stable also.  Thanks.

Applied & queued.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-07 17:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-06  2:55 [PATCH net 0/2] bnxt_en: Two bug fixes Michael Chan
2020-09-06  2:55 ` [PATCH net 1/2] bnxt_en: Avoid sending firmware messages when AER error is detected Michael Chan
2020-09-06  2:55 ` [PATCH net 2/2] bnxt_en: Fix NULL ptr dereference crash in bnxt_fw_reset_task() Michael Chan
2020-09-06 19:25   ` Jakub Kicinski
2020-09-06 22:07     ` Michael Chan
2020-09-07  3:13       ` Jakub Kicinski
2020-09-07  3:48         ` Michael Chan
2020-09-07 17:13           ` Jakub Kicinski
2020-09-07 17:16 ` [PATCH net 0/2] bnxt_en: Two bug fixes Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).