linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow
@ 2021-09-14  6:23 Shai Malin
  2021-09-14 10:00 ` Leon Romanovsky
  0 siblings, 1 reply; 5+ messages in thread
From: Shai Malin @ 2021-09-14  6:23 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, davem, kuba, linux-rdma, jgg, Ariel Elior, malin1024,
	Michal Kalderon

On Mon, Sep 13, 2021 at 5:45:00PM +0300, Leon Romanovsky wrote:
> On Mon, Sep 13, 2021 at 03:14:42PM +0300, Shai Malin wrote:
> > If the HW device is during recovery, the HW resources will never return,
> > hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
> > This fix speeds up the error recovery flow.
> >
> > Fixes: 64515dc899df ("qed: Add infrastructure for error detection and
> recovery")
> > Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
> > Signed-off-by: Ariel Elior <aelior@marvell.com>
> > Signed-off-by: Shai Malin <smalin@marvell.com>
> > ---
> >  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 7 +++++++
> >  drivers/net/ethernet/qlogic/qed/qed_roce.c  | 7 +++++++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > index fc8b3e64f153..4967e383c31a 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > @@ -1323,6 +1323,13 @@ static int qed_iwarp_wait_for_all_cids(struct
> qed_hwfn *p_hwfn)
> >  	int rc;
> >  	int i;
> >
> > +	/* If the HW device is during recovery, all resources are immediately
> > +	 * reset without receiving a per-cid indication from HW. In this case
> > +	 * we don't expect the cid_map to be cleared.
> > +	 */
> > +	if (p_hwfn->cdev->recov_in_prog)
> > +		return 0;
> 
> How do you ensure that this doesn't race with recovery flow?

The HW recovery will start with the management FW which will detect and report
the problem to the driver and it also set "cdev->recov_in_prog = ture" for all 
the devices on the same HW.
The qedr recovery flow is actually the qedr_remove flow but if 
"cdev->recov_in_prog = true" it will "ignore" the FW/HW resources.
The changes introduced with this patch are part of this qedr remove flow.
The cdev->recov_in_prog will be set to false only as part of the following 
probe and after the HW was re-initialized.

> 
> > +
> >  	rc = qed_iwarp_wait_cid_map_cleared(p_hwfn,
> >  					    &p_hwfn->p_rdma_info-
> >tcp_cid_map);
> >  	if (rc)
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > index f16a157bb95a..aff5a2871b8f 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > @@ -71,6 +71,13 @@ void qed_roce_stop(struct qed_hwfn *p_hwfn)
> >  	struct qed_bmap *rcid_map = &p_hwfn->p_rdma_info->real_cid_map;
> >  	int wait_count = 0;
> >
> > +	/* If the HW device is during recovery, all resources are immediately
> > +	 * reset without receiving a per-cid indication from HW. In this case
> > +	 * we don't expect the cid bitmap to be cleared.
> > +	 */
> > +	if (p_hwfn->cdev->recov_in_prog)
> > +		return;
> > +
> >  	/* when destroying a_RoCE QP the control is returned to the user after
> >  	 * the synchronous part. The asynchronous part may take a little longer.
> >  	 * We delay for a short while if an async destroy QP is still expected.
> > --
> > 2.22.0
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow
  2021-09-14  6:23 [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow Shai Malin
@ 2021-09-14 10:00 ` Leon Romanovsky
  0 siblings, 0 replies; 5+ messages in thread
From: Leon Romanovsky @ 2021-09-14 10:00 UTC (permalink / raw)
  To: Shai Malin
  Cc: netdev, davem, kuba, linux-rdma, jgg, Ariel Elior, malin1024,
	Michal Kalderon

On Tue, Sep 14, 2021 at 06:23:02AM +0000, Shai Malin wrote:
> On Mon, Sep 13, 2021 at 5:45:00PM +0300, Leon Romanovsky wrote:
> > On Mon, Sep 13, 2021 at 03:14:42PM +0300, Shai Malin wrote:
> > > If the HW device is during recovery, the HW resources will never return,
> > > hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
> > > This fix speeds up the error recovery flow.
> > >
> > > Fixes: 64515dc899df ("qed: Add infrastructure for error detection and
> > recovery")
> > > Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
> > > Signed-off-by: Ariel Elior <aelior@marvell.com>
> > > Signed-off-by: Shai Malin <smalin@marvell.com>
> > > ---
> > >  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 7 +++++++
> > >  drivers/net/ethernet/qlogic/qed/qed_roce.c  | 7 +++++++
> > >  2 files changed, 14 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > index fc8b3e64f153..4967e383c31a 100644
> > > --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > @@ -1323,6 +1323,13 @@ static int qed_iwarp_wait_for_all_cids(struct
> > qed_hwfn *p_hwfn)
> > >  	int rc;
> > >  	int i;
> > >
> > > +	/* If the HW device is during recovery, all resources are immediately
> > > +	 * reset without receiving a per-cid indication from HW. In this case
> > > +	 * we don't expect the cid_map to be cleared.
> > > +	 */
> > > +	if (p_hwfn->cdev->recov_in_prog)
> > > +		return 0;
> > 
> > How do you ensure that this doesn't race with recovery flow?
> 
> The HW recovery will start with the management FW which will detect and report
> the problem to the driver and it also set "cdev->recov_in_prog = ture" for all 
> the devices on the same HW.
> The qedr recovery flow is actually the qedr_remove flow but if 
> "cdev->recov_in_prog = true" it will "ignore" the FW/HW resources.
> The changes introduced with this patch are part of this qedr remove flow.
> The cdev->recov_in_prog will be set to false only as part of the following 
> probe and after the HW was re-initialized.

I asked how do you make sure that recov_in_prog is not changing to be
"true" right after your "if ..." check?

Thanks

> 
> > 
> > > +
> > >  	rc = qed_iwarp_wait_cid_map_cleared(p_hwfn,
> > >  					    &p_hwfn->p_rdma_info-
> > >tcp_cid_map);
> > >  	if (rc)
> > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > index f16a157bb95a..aff5a2871b8f 100644
> > > --- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > +++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > @@ -71,6 +71,13 @@ void qed_roce_stop(struct qed_hwfn *p_hwfn)
> > >  	struct qed_bmap *rcid_map = &p_hwfn->p_rdma_info->real_cid_map;
> > >  	int wait_count = 0;
> > >
> > > +	/* If the HW device is during recovery, all resources are immediately
> > > +	 * reset without receiving a per-cid indication from HW. In this case
> > > +	 * we don't expect the cid bitmap to be cleared.
> > > +	 */
> > > +	if (p_hwfn->cdev->recov_in_prog)
> > > +		return;
> > > +
> > >  	/* when destroying a_RoCE QP the control is returned to the user after
> > >  	 * the synchronous part. The asynchronous part may take a little longer.
> > >  	 * We delay for a short while if an async destroy QP is still expected.
> > > --
> > > 2.22.0
> > >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow
@ 2021-09-14 15:40 Shai Malin
  0 siblings, 0 replies; 5+ messages in thread
From: Shai Malin @ 2021-09-14 15:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, davem, kuba, linux-rdma, jgg, Ariel Elior, malin1024,
	Michal Kalderon


On Tue, Sep 14, 2021 at 1:01PM +0300, Leon Romanovsky wrote:
> On Tue, Sep 14, 2021 at 06:23:02AM +0000, Shai Malin wrote:
> > On Mon, Sep 13, 2021 at 5:45:00PM +0300, Leon Romanovsky wrote:
> > > On Mon, Sep 13, 2021 at 03:14:42PM +0300, Shai Malin wrote:
> > > > If the HW device is during recovery, the HW resources will never return,
> > > > hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
> > > > This fix speeds up the error recovery flow.
> > > >
> > > > Fixes: 64515dc899df ("qed: Add infrastructure for error detection and
> > > recovery")
> > > > Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
> > > > Signed-off-by: Ariel Elior <aelior@marvell.com>
> > > > Signed-off-by: Shai Malin <smalin@marvell.com>
> > > > ---
> > > >  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 7 +++++++
> > > >  drivers/net/ethernet/qlogic/qed/qed_roce.c  | 7 +++++++
> > > >  2 files changed, 14 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > > index fc8b3e64f153..4967e383c31a 100644
> > > > --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > > +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> > > > @@ -1323,6 +1323,13 @@ static int qed_iwarp_wait_for_all_cids(struct
> > > qed_hwfn *p_hwfn)
> > > >  	int rc;
> > > >  	int i;
> > > >
> > > > +	/* If the HW device is during recovery, all resources are immediately
> > > > +	 * reset without receiving a per-cid indication from HW. In this case
> > > > +	 * we don't expect the cid_map to be cleared.
> > > > +	 */
> > > > +	if (p_hwfn->cdev->recov_in_prog)
> > > > +		return 0;
> > >
> > > How do you ensure that this doesn't race with recovery flow?
> >
> > The HW recovery will start with the management FW which will detect and
> report
> > the problem to the driver and it also set "cdev->recov_in_prog = ture" for all
> > the devices on the same HW.
> > The qedr recovery flow is actually the qedr_remove flow but if
> > "cdev->recov_in_prog = true" it will "ignore" the FW/HW resources.
> > The changes introduced with this patch are part of this qedr remove flow.
> > The cdev->recov_in_prog will be set to false only as part of the following
> > probe and after the HW was re-initialized.
> 
> I asked how do you make sure that recov_in_prog is not changing to be
> "true" right after your "if ..." check?
> 
> Thanks

Thanks Leon - it's a valid point. Moving the "if..." to the while loop 
for both RoCE and iWARP will solve it.
I will fix it with V2.

> 
> >
> > >
> > > > +
> > > >  	rc = qed_iwarp_wait_cid_map_cleared(p_hwfn,
> > > >  					    &p_hwfn->p_rdma_info-
> > > >tcp_cid_map);
> > > >  	if (rc)
> > > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > > index f16a157bb95a..aff5a2871b8f 100644
> > > > --- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > > +++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> > > > @@ -71,6 +71,13 @@ void qed_roce_stop(struct qed_hwfn *p_hwfn)
> > > >  	struct qed_bmap *rcid_map = &p_hwfn->p_rdma_info->real_cid_map;
> > > >  	int wait_count = 0;
> > > >
> > > > +	/* If the HW device is during recovery, all resources are immediately
> > > > +	 * reset without receiving a per-cid indication from HW. In this case
> > > > +	 * we don't expect the cid bitmap to be cleared.
> > > > +	 */
> > > > +	if (p_hwfn->cdev->recov_in_prog)
> > > > +		return;
> > > > +
> > > >  	/* when destroying a_RoCE QP the control is returned to the user after
> > > >  	 * the synchronous part. The asynchronous part may take a little longer.
> > > >  	 * We delay for a short while if an async destroy QP is still expected.
> > > > --
> > > > 2.22.0
> > > >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow
  2021-09-13 12:14 Shai Malin
@ 2021-09-13 14:45 ` Leon Romanovsky
  0 siblings, 0 replies; 5+ messages in thread
From: Leon Romanovsky @ 2021-09-13 14:45 UTC (permalink / raw)
  To: Shai Malin
  Cc: netdev, davem, kuba, linux-rdma, jgg, aelior, malin1024, Michal Kalderon

On Mon, Sep 13, 2021 at 03:14:42PM +0300, Shai Malin wrote:
> If the HW device is during recovery, the HW resources will never return,
> hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
> This fix speeds up the error recovery flow.
> 
> Fixes: 64515dc899df ("qed: Add infrastructure for error detection and recovery")
> Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
> Signed-off-by: Ariel Elior <aelior@marvell.com>
> Signed-off-by: Shai Malin <smalin@marvell.com>
> ---
>  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 7 +++++++
>  drivers/net/ethernet/qlogic/qed/qed_roce.c  | 7 +++++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> index fc8b3e64f153..4967e383c31a 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
> @@ -1323,6 +1323,13 @@ static int qed_iwarp_wait_for_all_cids(struct qed_hwfn *p_hwfn)
>  	int rc;
>  	int i;
>  
> +	/* If the HW device is during recovery, all resources are immediately
> +	 * reset without receiving a per-cid indication from HW. In this case
> +	 * we don't expect the cid_map to be cleared.
> +	 */
> +	if (p_hwfn->cdev->recov_in_prog)
> +		return 0;

How do you ensure that this doesn't race with recovery flow?

> +
>  	rc = qed_iwarp_wait_cid_map_cleared(p_hwfn,
>  					    &p_hwfn->p_rdma_info->tcp_cid_map);
>  	if (rc)
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> index f16a157bb95a..aff5a2871b8f 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
> @@ -71,6 +71,13 @@ void qed_roce_stop(struct qed_hwfn *p_hwfn)
>  	struct qed_bmap *rcid_map = &p_hwfn->p_rdma_info->real_cid_map;
>  	int wait_count = 0;
>  
> +	/* If the HW device is during recovery, all resources are immediately
> +	 * reset without receiving a per-cid indication from HW. In this case
> +	 * we don't expect the cid bitmap to be cleared.
> +	 */
> +	if (p_hwfn->cdev->recov_in_prog)
> +		return;
> +
>  	/* when destroying a_RoCE QP the control is returned to the user after
>  	 * the synchronous part. The asynchronous part may take a little longer.
>  	 * We delay for a short while if an async destroy QP is still expected.
> -- 
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow
@ 2021-09-13 12:14 Shai Malin
  2021-09-13 14:45 ` Leon Romanovsky
  0 siblings, 1 reply; 5+ messages in thread
From: Shai Malin @ 2021-09-13 12:14 UTC (permalink / raw)
  To: netdev, davem, kuba
  Cc: linux-rdma, jgg, aelior, smalin, malin1024, Michal Kalderon

If the HW device is during recovery, the HW resources will never return,
hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
This fix speeds up the error recovery flow.

Fixes: 64515dc899df ("qed: Add infrastructure for error detection and recovery")
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
---
 drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 7 +++++++
 drivers/net/ethernet/qlogic/qed/qed_roce.c  | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
index fc8b3e64f153..4967e383c31a 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
@@ -1323,6 +1323,13 @@ static int qed_iwarp_wait_for_all_cids(struct qed_hwfn *p_hwfn)
 	int rc;
 	int i;
 
+	/* If the HW device is during recovery, all resources are immediately
+	 * reset without receiving a per-cid indication from HW. In this case
+	 * we don't expect the cid_map to be cleared.
+	 */
+	if (p_hwfn->cdev->recov_in_prog)
+		return 0;
+
 	rc = qed_iwarp_wait_cid_map_cleared(p_hwfn,
 					    &p_hwfn->p_rdma_info->tcp_cid_map);
 	if (rc)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c b/drivers/net/ethernet/qlogic/qed/qed_roce.c
index f16a157bb95a..aff5a2871b8f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
@@ -71,6 +71,13 @@ void qed_roce_stop(struct qed_hwfn *p_hwfn)
 	struct qed_bmap *rcid_map = &p_hwfn->p_rdma_info->real_cid_map;
 	int wait_count = 0;
 
+	/* If the HW device is during recovery, all resources are immediately
+	 * reset without receiving a per-cid indication from HW. In this case
+	 * we don't expect the cid bitmap to be cleared.
+	 */
+	if (p_hwfn->cdev->recov_in_prog)
+		return;
+
 	/* when destroying a_RoCE QP the control is returned to the user after
 	 * the synchronous part. The asynchronous part may take a little longer.
 	 * We delay for a short while if an async destroy QP is still expected.
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-14 15:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14  6:23 [PATCH net] qed: rdma - don't wait for resources under hw error recovery flow Shai Malin
2021-09-14 10:00 ` Leon Romanovsky
  -- strict thread matches above, loose matches on Subject: below --
2021-09-14 15:40 Shai Malin
2021-09-13 12:14 Shai Malin
2021-09-13 14:45 ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).