[PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
@ 2021-02-14 22:26 Bob Pearson
  2021-02-15  3:46 ` Zhu Yanjun
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Bob Pearson @ 2021-02-14 22:26 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Three errors occurred in the fix referenced below.

1) The on and off again 'if (skb)' got dropped but was really
needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
on the non-error path.

2) Extending the reference taken by rxe_get_dev_from_net() in
rxe_udp_encap_recv() until each skb is freed was not matched by
a reference in the loopback path resulting in underflows.

3) In rxe_comp.c the function free_pkt() did not clear skb which
triggered a warning at done: and could possibly at exit: in
rxe_completer(). The WARN_ONCE() calls are not required at done:
and only in one place before going to exit.

This patch fixes these errors.

Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
Signed-off-by: Bob Pearson <rpearson@hpe.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
 drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
 drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index a8ac791a1bb9..13fc5a1cced1 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -671,6 +671,9 @@ int rxe_completer(void *arg)
 			 * it down the road or let it expire
 			 */
 
+			/* warn if we did receive a packet */
+			WARN_ON_ONCE(skb);
+
 			/* there is nothing to retry in this case */
 			if (!wqe || (wqe->state == wqe_state_posted))
 				goto exit;
@@ -750,7 +753,6 @@ int rxe_completer(void *arg)
 	/* we come here if we are done with processing and want the task to
 	 * exit from the loop calling us
 	 */
-	WARN_ON_ONCE(skb);
 	rxe_drop_ref(qp);
 	return -EAGAIN;
 
@@ -758,7 +760,6 @@ int rxe_completer(void *arg)
 	/* we come here if we have processed a packet we want the task to call
 	 * us again to see if there is anything else to do
 	 */
-	WARN_ON_ONCE(skb);
 	rxe_drop_ref(qp);
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 36d56163afac..8e81df578552 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -406,12 +406,17 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 
 void rxe_loopback(struct sk_buff *skb)
 {
+	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
+
 	if (skb->protocol == htons(ETH_P_IP))
 		skb_pull(skb, sizeof(struct iphdr));
 	else
 		skb_pull(skb, sizeof(struct ipv6hdr));
 
-	rxe_rcv(skb);
+	if (WARN_ON(!ib_device_try_get(&pkt->rxe->ib_dev)))
+		kfree_skb(skb);
+	else
+		rxe_rcv(skb);
 }
 
 struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 8a48a33d587b..a5e330e3bbce 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -299,8 +299,10 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 
 err1:
 	/* free skb if not consumed */
-	kfree_skb(skb);
-	ib_device_put(&rxe->ib_dev);
+	if (unlikely(skb)) {
+		kfree_skb(skb);
+		ib_device_put(&rxe->ib_dev);
+	}
 }
 
 /**
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-14 22:26 [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again) Bob Pearson
@ 2021-02-15  3:46 ` Zhu Yanjun
  2021-02-15  5:59 ` Leon Romanovsky
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Zhu Yanjun @ 2021-02-15  3:46 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, RDMA mailing list, Bob Pearson

On Mon, Feb 15, 2021 at 6:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> Three errors occurred in the fix referenced below.
>
> 1) The on and off again 'if (skb)' got dropped but was really
> needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
> on the non-error path.
>
> 2) Extending the reference taken by rxe_get_dev_from_net() in
> rxe_udp_encap_recv() until each skb is freed was not matched by
> a reference in the loopback path resulting in underflows.
>
> 3) In rxe_comp.c the function free_pkt() did not clear skb which
> triggered a warning at done: and could possibly at exit: in
> rxe_completer(). The WARN_ONCE() calls are not required at done:
> and only in one place before going to exit.
>
> This patch fixes these errors.
>
> Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
> Signed-off-by: Bob Pearson <rpearson@hpe.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
>  drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
>  drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
>  3 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index a8ac791a1bb9..13fc5a1cced1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -671,6 +671,9 @@ int rxe_completer(void *arg)
>                          * it down the road or let it expire
>                          */
>
> +                       /* warn if we did receive a packet */
> +                       WARN_ON_ONCE(skb);
> +
>                         /* there is nothing to retry in this case */
>                         if (!wqe || (wqe->state == wqe_state_posted))
>                                 goto exit;
> @@ -750,7 +753,6 @@ int rxe_completer(void *arg)
>         /* we come here if we are done with processing and want the task to
>          * exit from the loop calling us
>          */
> -       WARN_ON_ONCE(skb);
>         rxe_drop_ref(qp);
>         return -EAGAIN;
>
> @@ -758,7 +760,6 @@ int rxe_completer(void *arg)
>         /* we come here if we have processed a packet we want the task to call
>          * us again to see if there is anything else to do
>          */
> -       WARN_ON_ONCE(skb);
>         rxe_drop_ref(qp);
>         return 0;
>  }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 36d56163afac..8e81df578552 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -406,12 +406,17 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>
>  void rxe_loopback(struct sk_buff *skb)
>  {
> +       struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> +
>         if (skb->protocol == htons(ETH_P_IP))
>                 skb_pull(skb, sizeof(struct iphdr));
>         else
>                 skb_pull(skb, sizeof(struct ipv6hdr));
>
> -       rxe_rcv(skb);
> +       if (WARN_ON(!ib_device_try_get(&pkt->rxe->ib_dev)))
> +               kfree_skb(skb);
> +       else
> +               rxe_rcv(skb);
>  }
>
>  struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 8a48a33d587b..a5e330e3bbce 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -299,8 +299,10 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>
>  err1:
>         /* free skb if not consumed */
> -       kfree_skb(skb);
> -       ib_device_put(&rxe->ib_dev);
> +       if (unlikely(skb)) {

From Leon Romanovsky
"
Please don't put "if (a) kfree(a);" constructions unless you want to
deal with daily flux of patches with attempt to remove "if".
"

Zhu Yanjun

> +               kfree_skb(skb);
> +               ib_device_put(&rxe->ib_dev);
> +       }
>  }
>
>  /**
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-14 22:26 [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again) Bob Pearson
  2021-02-15  3:46 ` Zhu Yanjun
@ 2021-02-15  5:59 ` Leon Romanovsky
  2021-02-26 23:28 ` Bob Pearson
  2021-03-02  5:19 ` Zhu Yanjun
  3 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-02-15  5:59 UTC (permalink / raw)
  To: Bob Pearson; +Cc: jgg, zyjzyj2000, linux-rdma, Bob Pearson

On Sun, Feb 14, 2021 at 04:26:31PM -0600, Bob Pearson wrote:
> Three errors occurred in the fix referenced below.
>
> 1) The on and off again 'if (skb)' got dropped but was really
> needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
> on the non-error path.
>
> 2) Extending the reference taken by rxe_get_dev_from_net() in
> rxe_udp_encap_recv() until each skb is freed was not matched by
> a reference in the loopback path resulting in underflows.
>
> 3) In rxe_comp.c the function free_pkt() did not clear skb which
> triggered a warning at done: and could possibly at exit: in
> rxe_completer(). The WARN_ONCE() calls are not required at done:
> and only in one place before going to exit.
>
> This patch fixes these errors.
>
> Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
> Signed-off-by: Bob Pearson <rpearson@hpe.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
>  drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
>  drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
>  3 files changed, 13 insertions(+), 5 deletions(-)


diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 8a48a33d587b..29cb0125e76f 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -247,6 +247,11 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	else if (skb->protocol == htons(ETH_P_IPV6))
 		memcpy(&dgid, &ipv6_hdr(skb)->daddr, sizeof(dgid));

+	if (!ib_device_try_get(&rxe->ib_dev)) {
+		kfree_skb(skb);
+		return;
+	}
+
 	/* lookup mcast group corresponding to mgid, takes a ref */
 	mcg = rxe_pool_get_key(&rxe->mc_grp_pool, &dgid);
 	if (!mcg)
@@ -274,10 +279,6 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 		 */
 		if (mce->qp_list.next != &mcg->qp_list) {
 			per_qp_skb = skb_clone(skb, GFP_ATOMIC);
-			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
-				kfree_skb(per_qp_skb);
-				continue;
-			}
 		} else {
 			per_qp_skb = skb;
 			/* show we have consumed the skb */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-14 22:26 [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again) Bob Pearson
  2021-02-15  3:46 ` Zhu Yanjun
  2021-02-15  5:59 ` Leon Romanovsky
@ 2021-02-26 23:28 ` Bob Pearson
  2021-02-26 23:33   ` Jason Gunthorpe
  2021-03-02  5:19 ` Zhu Yanjun
  3 siblings, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2021-02-26 23:28 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma

On 2/14/21 4:26 PM, Bob Pearson wrote:
> Three errors occurred in the fix referenced below.
> 
> 1) The on and off again 'if (skb)' got dropped but was really
> needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
> on the non-error path.
> 
> 2) Extending the reference taken by rxe_get_dev_from_net() in
> rxe_udp_encap_recv() until each skb is freed was not matched by
> a reference in the loopback path resulting in underflows.
> 
> 3) In rxe_comp.c the function free_pkt() did not clear skb which
> triggered a warning at done: and could possibly at exit: in
> rxe_completer(). The WARN_ONCE() calls are not required at done:
> and only in one place before going to exit.
> 
> This patch fixes these errors.
> 
> Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
> Signed-off-by: Bob Pearson <rpearson@hpe.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
>  drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
>  drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
>  3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index a8ac791a1bb9..13fc5a1cced1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -671,6 +671,9 @@ int rxe_completer(void *arg)
>  			 * it down the road or let it expire
>  			 */
>  
> +			/* warn if we did receive a packet */
> +			WARN_ON_ONCE(skb);
> +
>  			/* there is nothing to retry in this case */
>  			if (!wqe || (wqe->state == wqe_state_posted))
>  				goto exit;
> @@ -750,7 +753,6 @@ int rxe_completer(void *arg)
>  	/* we come here if we are done with processing and want the task to
>  	 * exit from the loop calling us
>  	 */
> -	WARN_ON_ONCE(skb);
>  	rxe_drop_ref(qp);
>  	return -EAGAIN;
>  
> @@ -758,7 +760,6 @@ int rxe_completer(void *arg)
>  	/* we come here if we have processed a packet we want the task to call
>  	 * us again to see if there is anything else to do
>  	 */
> -	WARN_ON_ONCE(skb);
>  	rxe_drop_ref(qp);
>  	return 0;
>  }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 36d56163afac..8e81df578552 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -406,12 +406,17 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>  
>  void rxe_loopback(struct sk_buff *skb)
>  {
> +	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> +
>  	if (skb->protocol == htons(ETH_P_IP))
>  		skb_pull(skb, sizeof(struct iphdr));
>  	else
>  		skb_pull(skb, sizeof(struct ipv6hdr));
>  
> -	rxe_rcv(skb);
> +	if (WARN_ON(!ib_device_try_get(&pkt->rxe->ib_dev)))
> +		kfree_skb(skb);
> +	else
> +		rxe_rcv(skb);
>  }
>  
>  struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 8a48a33d587b..a5e330e3bbce 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -299,8 +299,10 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>  
>  err1:
>  	/* free skb if not consumed */
> -	kfree_skb(skb);
> -	ib_device_put(&rxe->ib_dev);
> +	if (unlikely(skb)) {
> +		kfree_skb(skb);
> +		ib_device_put(&rxe->ib_dev);
> +	}
>  }
>  
>  /**
> 
Just a reminder. rxe in for-next is broken until this gets done.
thanks

bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-26 23:28 ` Bob Pearson
@ 2021-02-26 23:33   ` Jason Gunthorpe
  2021-02-27  0:02     ` Bob Pearson
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2021-02-26 23:33 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
> Just a reminder. rxe in for-next is broken until this gets done.
> thanks

I was expecting you to resend it? There seemed to be some changes
needed

https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-26 23:33   ` Jason Gunthorpe
@ 2021-02-27  0:02     ` Bob Pearson
  2021-02-27  8:43       ` Leon Romanovsky
  0 siblings, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2021-02-27  0:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, linux-rdma

On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
> On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
>> Just a reminder. rxe in for-next is broken until this gets done.
>> thanks
> 
> I was expecting you to resend it? There seemed to be some changes
> needed
> 
> https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
> 
> Jason
> 
OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.

bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-27  0:02     ` Bob Pearson
@ 2021-02-27  8:43       ` Leon Romanovsky
  2021-02-28 17:04         ` Bob Pearson
  0 siblings, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2021-02-27  8:43 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, zyjzyj2000, linux-rdma

On Fri, Feb 26, 2021 at 06:02:39PM -0600, Bob Pearson wrote:
> On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
> > On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
> >> Just a reminder. rxe in for-next is broken until this gets done.
> >> thanks
> >
> > I was expecting you to resend it? There seemed to be some changes
> > needed
> >
> > https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
> >
> > Jason
> >
> OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.

First, I posted a diff which makes this if() redundant.
Second, the if () before kfree() is checked by coccinelle and your
"should stay the way it is" will be marked as failure in many CIs,
including ours.

Thanks

>
> bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-27  8:43       ` Leon Romanovsky
@ 2021-02-28 17:04         ` Bob Pearson
  2021-03-01  7:24           ` Leon Romanovsky
  2021-03-01  7:42           ` Zhu Yanjun
  0 siblings, 2 replies; 17+ messages in thread
From: Bob Pearson @ 2021-02-28 17:04 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, zyjzyj2000, linux-rdma, Frank Zago

On 2/27/21 2:43 AM, Leon Romanovsky wrote:
> On Fri, Feb 26, 2021 at 06:02:39PM -0600, Bob Pearson wrote:
>> On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
>>> On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
>>>> Just a reminder. rxe in for-next is broken until this gets done.
>>>> thanks
>>>
>>> I was expecting you to resend it? There seemed to be some changes
>>> needed
>>>
>>> https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
>>>
>>> Jason
>>>
>> OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.
> 
> First, I posted a diff which makes this if() redundant.
> Second, the if () before kfree() is checked by coccinelle and your
> "should stay the way it is" will be marked as failure in many CIs,
> including ours.
> 
> Thanks
> 
>>
>> bob

Leon,

I am not sure we are talking about the same if statement. You wrote

...
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 8a48a33d587b..29cb0125e76f 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -247,6 +247,11 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	else if (skb->protocol == htons(ETH_P_IPV6))
 		memcpy(&dgid, &ipv6_hdr(skb)->daddr, sizeof(dgid));

+	if (!ib_device_try_get(&rxe->ib_dev)) {
+		kfree_skb(skb);
+		return;
+	}
+
 	/* lookup mcast group corresponding to mgid, takes a ref */
 	mcg = rxe_pool_get_key(&rxe->mc_grp_pool, &dgid);
 	if (!mcg)
@@ -274,10 +279,6 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 		 */
 		if (mce->qp_list.next != &mcg->qp_list) {
 			per_qp_skb = skb_clone(skb, GFP_ATOMIC);
-			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
-				kfree_skb(per_qp_skb);
-				continue;
-			}
 		} else {
 			per_qp_skb = skb;
 			/* show we have consumed the skb */
...

which I don't understand.

When a received packet is delivered to the rxe driver in rxe_net.c in rxe_udp_encap_recv() rxe_get_dev_from_net() is called which gets a pointer to the ib_device (contained in rxe_dev) and also takes a reference on the ib_device. This pointer is stored in skb->cb[] so the reference needs to be held until the skb is freed. If the skb has a multicast address and there are more than one QPs belonging to the multicast group then new skbs are cloned in rxe_rcv_mcast_pkt() and each has a pointer to the ib_device. Since each skb can have quite different lifetimes they each need to carry a reference to ib_device to protect against having it deleted out from under them. You suggest adding one more reference outside of the loop regardless of how many QPs, if any, belong to the multicast group. I don't see how this can be correct.

In any case this is *not* the if statement that is under discussion in the patch. That one has to do with an error which can occur if the last QP in the list (which gets the original skb in the non-error case) doesn't match or isn't ready to receive the packet and it fails either check_type_state() or check_keys() and falls out of the loop. Now the reference to the ib_device needs to be let go and the skb needs to be freed but only if this error occurs. In the normal case that all happens when the skb if done being processed after calling rxe_rcv_pkt().

So the discussion boils down to whether to type

...
err1:
	kfree_skb(skb);
	if (unlikely(skb))
		ib_device_put(&rxe->ib_dev);

or

err1:
	if (unlikely(skb)) {
		kfree_skb(skb);
		ib_device_put(&rxe->ib_dev);
	}

Here the normal non-error path has skb == NULL and the error path has skb set to the originally delivered packet. The second choice is much clearer as it shows the intent and saves the wasted trip to kfree_skb() for every packet.

bob

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-28 17:04         ` Bob Pearson
@ 2021-03-01  7:24           ` Leon Romanovsky
  2021-03-01 16:54             ` Bob Pearson
  2021-03-01  7:42           ` Zhu Yanjun
  1 sibling, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:24 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, zyjzyj2000, linux-rdma, Frank Zago

On Sun, Feb 28, 2021 at 11:04:08AM -0600, Bob Pearson wrote:
> On 2/27/21 2:43 AM, Leon Romanovsky wrote:
> > On Fri, Feb 26, 2021 at 06:02:39PM -0600, Bob Pearson wrote:
> >> On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
> >>> On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
> >>>> Just a reminder. rxe in for-next is broken until this gets done.
> >>>> thanks
> >>>
> >>> I was expecting you to resend it? There seemed to be some changes
> >>> needed
> >>>
> >>> https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
> >>>
> >>> Jason
> >>>
> >> OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.
> >
> > First, I posted a diff which makes this if() redundant.
> > Second, the if () before kfree() is checked by coccinelle and your
> > "should stay the way it is" will be marked as failure in many CIs,
> > including ours.
> >
> > Thanks
> >
> >>
> >> bob
>
> Leon,
>
> I am not sure we are talking about the same if statement. You wrote
>
> ...
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 8a48a33d587b..29cb0125e76f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -247,6 +247,11 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>  	else if (skb->protocol == htons(ETH_P_IPV6))
>  		memcpy(&dgid, &ipv6_hdr(skb)->daddr, sizeof(dgid));
>
> +	if (!ib_device_try_get(&rxe->ib_dev)) {
> +		kfree_skb(skb);
> +		return;
> +	}
> +
>  	/* lookup mcast group corresponding to mgid, takes a ref */
>  	mcg = rxe_pool_get_key(&rxe->mc_grp_pool, &dgid);
>  	if (!mcg)
> @@ -274,10 +279,6 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>  		 */
>  		if (mce->qp_list.next != &mcg->qp_list) {
>  			per_qp_skb = skb_clone(skb, GFP_ATOMIC);
> -			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
> -				kfree_skb(per_qp_skb);
> -				continue;
> -			}
>  		} else {
>  			per_qp_skb = skb;
>  			/* show we have consumed the skb */
> ...
>
> which I don't understand.
>
> When a received packet is delivered to the rxe driver in rxe_net.c in rxe_udp_encap_recv() rxe_get_dev_from_net() is called which gets a pointer to the ib_device (contained in rxe_dev) and also takes a reference on the ib_device. This pointer is stored in skb->cb[] so the reference needs to be held until the skb is freed. If the skb has a multicast address and there are more than one QPs belonging to the multicast group then new skbs are cloned in rxe_rcv_mcast_pkt() and each has a pointer to the ib_device. Since each skb can have quite different lifetimes they each need to carry a reference to ib_device to protect against having it deleted out from under them. You suggest adding one more reference outside of the loop regardless of how many QPs, if any, belong to the multicast group. I don't see how this can be correct.
>
> In any case this is *not* the if statement that is under discussion in the patch. That one has to do with an error which can occur if the last QP in the list (which gets the original skb in the non-error case) doesn't match or isn't ready to receive the packet and it fails either check_type_state() or check_keys() and falls out of the loop. Now the reference to the ib_device needs to be let go and the skb needs to be freed but only if this error occurs. In the normal case that all happens when the skb if done being processed after calling rxe_rcv_pkt().
>
> So the discussion boils down to whether to type
>
> ...
> err1:
> 	kfree_skb(skb);
> 	if (unlikely(skb))
> 		ib_device_put(&rxe->ib_dev);
>
> or
>
> err1:
> 	if (unlikely(skb)) {
> 		kfree_skb(skb);
> 		ib_device_put(&rxe->ib_dev);
> 	}
>
> Here the normal non-error path has skb == NULL and the error path has skb set to the originally delivered packet. The second choice is much clearer as it shows the intent and saves the wasted trip to kfree_skb() for every packet.

Can you please configure your mail client so your replies won't be one
long unreadable lines? It will help us to read your replies and we will
be able to answer them separately.

Once the rxe_rcv_mcast_pkt() is called, the device and SKB are already
"connected" each one to another, so I don't understand the claims about
different lifetimes. It is not the "if ()", but the whole idea that every
SKB increments reference counter sounds very strange.

All QPs, mcast groups and SKB points to the same ib_dev and even one
refcount is enough to ensure that it won't vanish. This is why it is
enough to call ib_device_try_get() at the beginning of this function.

Also the combination of ib_device_get() together with unlikely() to save
kfree call can't be right either.

Thanks

>
> bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-28 17:04         ` Bob Pearson
  2021-03-01  7:24           ` Leon Romanovsky
@ 2021-03-01  7:42           ` Zhu Yanjun
  1 sibling, 0 replies; 17+ messages in thread
From: Zhu Yanjun @ 2021-03-01  7:42 UTC (permalink / raw)
  To: Bob Pearson
  Cc: Leon Romanovsky, Jason Gunthorpe, RDMA mailing list, Frank Zago

On Mon, Mar 1, 2021 at 1:04 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 2/27/21 2:43 AM, Leon Romanovsky wrote:
> > On Fri, Feb 26, 2021 at 06:02:39PM -0600, Bob Pearson wrote:
> >> On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
> >>> On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
> >>>> Just a reminder. rxe in for-next is broken until this gets done.
> >>>> thanks
> >>>
> >>> I was expecting you to resend it? There seemed to be some changes
> >>> needed
> >>>
> >>> https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
> >>>
> >>> Jason
> >>>
> >> OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.
> >
> > First, I posted a diff which makes this if() redundant.
> > Second, the if () before kfree() is checked by coccinelle and your
> > "should stay the way it is" will be marked as failure in many CIs,
> > including ours.
> >
> > Thanks
> >
> >>
> >> bob
>
> Leon,
>
> I am not sure we are talking about the same if statement. You wrote
>
> ...
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 8a48a33d587b..29cb0125e76f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -247,6 +247,11 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>         else if (skb->protocol == htons(ETH_P_IPV6))
>                 memcpy(&dgid, &ipv6_hdr(skb)->daddr, sizeof(dgid));
>
> +       if (!ib_device_try_get(&rxe->ib_dev)) {
> +               kfree_skb(skb);
> +               return;
> +       }
> +
>         /* lookup mcast group corresponding to mgid, takes a ref */
>         mcg = rxe_pool_get_key(&rxe->mc_grp_pool, &dgid);
>         if (!mcg)
> @@ -274,10 +279,6 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>                  */
>                 if (mce->qp_list.next != &mcg->qp_list) {
>                         per_qp_skb = skb_clone(skb, GFP_ATOMIC);
> -                       if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
> -                               kfree_skb(per_qp_skb);
> -                               continue;
> -                       }
>                 } else {
>                         per_qp_skb = skb;
>                         /* show we have consumed the skb */
> ...
>
> which I don't understand.
>
> When a received packet is delivered to the rxe driver in rxe_net.c in rxe_udp_encap_recv() rxe_get_dev_from_net() is called which gets a pointer to the ib_device (contained in rxe_dev) and also takes a reference on the ib_device. This pointer is stored in skb->cb[] so the reference needs to be held until the skb is freed. If the skb has a multicast address and there are more than one QPs belonging to the multicast group then new skbs are cloned in rxe_rcv_mcast_pkt() and each has a pointer to the ib_device. Since each skb can have quite different lifetimes they each need to carry a reference to ib_device to protect against having it deleted out from under them. You suggest adding one more reference outside of the loop regardless of how many QPs, if any, belong to the multicast group. I don't see how this can be correct.
>
> In any case this is *not* the if statement that is under discussion in the patch. That one has to do with an error which can occur if the last QP in the list (which gets the original skb in the non-error case) doesn't match or isn't ready to receive the packet and it fails either check_type_state() or check_keys() and falls out of the loop. Now the reference to the ib_device needs to be let go and the skb needs to be freed but only if this error occurs. In the normal case that all happens when the skb if done being processed after calling rxe_rcv_pkt().
>
> So the discussion boils down to whether to type
>
> ...
> err1:
>         kfree_skb(skb);
>         if (unlikely(skb))
>                 ib_device_put(&rxe->ib_dev);
>
> or
>
> err1:
>         if (unlikely(skb)) {
>                 kfree_skb(skb);
>                 ib_device_put(&rxe->ib_dev);
>         }
>
> Here the normal non-error path has skb == NULL and the error path has skb set to the originally delivered packet. The second choice is much clearer as it shows the intent and saves the wasted trip to kfree_skb() for every packet.

Placing kfree_skb in a if (skb) test is not good.

Zhu Yanjun
>
> bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-01  7:24           ` Leon Romanovsky
@ 2021-03-01 16:54             ` Bob Pearson
  2021-03-01 17:35               ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Bob Pearson @ 2021-03-01 16:54 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, zyjzyj2000, linux-rdma, Frank Zago

On 3/1/21 1:24 AM, Leon Romanovsky wrote:
> On Sun, Feb 28, 2021 at 11:04:08AM -0600, Bob Pearson wrote:
>> On 2/27/21 2:43 AM, Leon Romanovsky wrote:
>>> On Fri, Feb 26, 2021 at 06:02:39PM -0600, Bob Pearson wrote:
>>>> On 2/26/21 5:33 PM, Jason Gunthorpe wrote:
>>>>> On Fri, Feb 26, 2021 at 05:28:41PM -0600, Bob Pearson wrote:
>>>>>> Just a reminder. rxe in for-next is broken until this gets done.
>>>>>> thanks
>>>>>
>>>>> I was expecting you to resend it? There seemed to be some changes
>>>>> needed
>>>>>
>>>>> https://patchwork.kernel.org/project/linux-rdma/patch/20210214222630.3901-1-rpearson@hpe.com/
>>>>>
>>>>> Jason
>>>>>
>>>> OK. I see. I agreed to that complaint when the kfree was the only thing in the if {} but now I have to call ib_device_put() *only* in the error case not if there wasn't an error. So no reason to not put the kfree_skb() in there too and avoid passing a NULL pointer. It should stay the way it is.
>>>
>>> First, I posted a diff which makes this if() redundant.
>>> Second, the if () before kfree() is checked by coccinelle and your
>>> "should stay the way it is" will be marked as failure in many CIs,
>>> including ours.
>>>
>>> Thanks
>>>
>>>>
>>>> bob
>>
>> Leon,
>>
>> I am not sure we are talking about the same if statement. You wrote
>>
>> ...
>> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
>> index 8a48a33d587b..29cb0125e76f 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
>> @@ -247,6 +247,11 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>>  	else if (skb->protocol == htons(ETH_P_IPV6))
>>  		memcpy(&dgid, &ipv6_hdr(skb)->daddr, sizeof(dgid));
>>
>> +	if (!ib_device_try_get(&rxe->ib_dev)) {
>> +		kfree_skb(skb);
>> +		return;
>> +	}
>> +
>>  	/* lookup mcast group corresponding to mgid, takes a ref */
>>  	mcg = rxe_pool_get_key(&rxe->mc_grp_pool, &dgid);
>>  	if (!mcg)
>> @@ -274,10 +279,6 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>>  		 */
>>  		if (mce->qp_list.next != &mcg->qp_list) {
>>  			per_qp_skb = skb_clone(skb, GFP_ATOMIC);
>> -			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
>> -				kfree_skb(per_qp_skb);
>> -				continue;
>> -			}
>>  		} else {
>>  			per_qp_skb = skb;
>>  			/* show we have consumed the skb */
>> ...
>>
>> which I don't understand.
>>
>> When a received packet is delivered to the rxe driver in rxe_net.c in rxe_udp_encap_recv() rxe_get_dev_from_net() is called which gets a pointer to the ib_device (contained in rxe_dev) and also takes a reference on the ib_device. This pointer is stored in skb->cb[] so the reference needs to be held until the skb is freed. If the skb has a multicast address and there are more than one QPs belonging to the multicast group then new skbs are cloned in rxe_rcv_mcast_pkt() and each has a pointer to the ib_device. Since each skb can have quite different lifetimes they each need to carry a reference to ib_device to protect against having it deleted out from under them. You suggest adding one more reference outside of the loop regardless of how many QPs, if any, belong to the multicast group. I don't see how this can be correct.
>>
>> In any case this is *not* the if statement that is under discussion in the patch. That one has to do with an error which can occur if the last QP in the list (which gets the original skb in the non-error case) doesn't match or isn't ready to receive the packet and it fails either check_type_state() or check_keys() and falls out of the loop. Now the reference to the ib_device needs to be let go and the skb needs to be freed but only if this error occurs. In the normal case that all happens when the skb if done being processed after calling rxe_rcv_pkt().
>>
>> So the discussion boils down to whether to type
>>
>> ...
>> err1:
>> 	kfree_skb(skb);
>> 	if (unlikely(skb))
>> 		ib_device_put(&rxe->ib_dev);
>>
>> or
>>
>> err1:
>> 	if (unlikely(skb)) {
>> 		kfree_skb(skb);
>> 		ib_device_put(&rxe->ib_dev);
>> 	}
>>
>> Here the normal non-error path has skb == NULL and the error path has skb set to the originally delivered packet. The second choice is much clearer as it shows the intent and saves the wasted trip to kfree_skb() for every packet.
> 
> Can you please configure your mail client so your replies won't be one
> long unreadable lines? It will help us to read your replies and we will
> be able to answer them separately.
Sorry about that. It's Thunderbird. Looking for a solution.
> 
> Once the rxe_rcv_mcast_pkt() is called, the device and SKB are already
> "connected" each one to another, so I don't understand the claims about
> different lifetimes. It is not the "if ()", but the whole idea that every
> SKB increments reference counter sounds very strange.
The purpose of rxe_rcv_mcast_pkt() is to replicate the original received skb
as many times as necessary to enqueue to the multiple QPs which are listening to
the multicast address. Depending on the queue depths, which CPU, etc. they will
take more or less time to process. There is no telling which one of them will be
the last one to get done. Counting them is the easiest way to figure out when we
are complete.
> 
> All QPs, mcast groups and SKB points to the same ib_dev and even one
> refcount is enough to ensure that it won't vanish. This is why it is
> enough to call ib_device_try_get() at the beginning of this function.
I agree that ib_device_get/put is attempting to solve a problem that it not
really very critical since ib_device is very unlikely to be shut down in the
middle of a data transfer. The driver never worried about this for years.
But now that it's been put on the table it should be done right. A data packet
arriving is completely independent of the verbs API which *could* delete all the
QPs and shut down the HCA while it was wondering around the universe or worse
yet while the packet is being processed.
You are correct one call will protect the address, but how will you decide
when you are ready to drop that one reference? You have N clones of the skb sitting
on queues waiting to get processed. Where do you stick the drop reference so that
it only happens once and only when all the skbs are done?
> 
> Also the combination of ib_device_get() together with unlikely() to save
> kfree call can't be right either.
That is not why they are there. The ib_device_put is there because the skb holds
a reference to the ib_device so it needs to be dropped. The unlikely is because it
is actually unlikely. It only happens when there is a bad packet.
> 
> Thanks
> 
>>
>> bob


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-01 16:54             ` Bob Pearson
@ 2021-03-01 17:35               ` Jason Gunthorpe
  2021-03-01 18:20                 ` Pearson, Robert B
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2021-03-01 17:35 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Leon Romanovsky, zyjzyj2000, linux-rdma, Frank Zago

On Mon, Mar 01, 2021 at 10:54:21AM -0600, Bob Pearson wrote:

> I agree that ib_device_get/put is attempting to solve a problem that it not
> really very critical since ib_device is very unlikely to be shut down in the
> middle of a data transfer. The driver never worried about this for years.
> But now that it's been put on the table it should be done right. A data packet
> arriving is completely independent of the verbs API which *could* delete all the
> QPs and shut down the HCA while it was wondering around the universe or worse
> yet while the packet is being processed.

If driver shutdown can guarentee that all pointers involved in
multicast are revoked before shutdown can finish then you don't need
this refcounting.

It was only brought up because the API that returns the ib_device from
the netdev requires the refcounts as it is general purpose

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-01 17:35               ` Jason Gunthorpe
@ 2021-03-01 18:20                 ` Pearson, Robert B
  2021-03-01 18:27                   ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Pearson, Robert B @ 2021-03-01 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Bob Pearson
  Cc: Leon Romanovsky, zyjzyj2000, linux-rdma, Zago, Frank


> From: Jason Gunthorpe <jgg@nvidia.com> 
> Sent: Monday, March 1, 2021 11:36 AM
> To: Bob Pearson <rpearsonhpe@gmail.com>
> Cc: Leon Romanovsky <leon@kernel.org>; zyjzyj2000@gmail.com; linux-rdma@vger.kernel.org; Zago, Frank <frank.zago@hpe.com>
> Subject: Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)

> On Mon, Mar 01, 2021 at 10:54:21AM -0600, Bob Pearson wrote:

>> I agree that ib_device_get/put is attempting to solve a problem that 
>> it not really very critical since ib_device is very unlikely to be 
>> shut down in the middle of a data transfer. The driver never worried about this for years.
>> But now that it's been put on the table it should be done right. A 
>> data packet arriving is completely independent of the verbs API which 
>> *could* delete all the QPs and shut down the HCA while it was 
>> wondering around the universe or worse yet while the packet is being processed.

> If driver shutdown can guarentee that all pointers involved in multicast are revoked before shutdown can finish then you don't need this
> refcounting.

> It was only brought up because the API that returns the ib_device from the netdev requires the refcounts as it is general purpose

> Jason
Unfortunately what you ask for is exactly what the refcounting code accomplishes and I don't see a simpler way to get there.
This also applies to the non-multicast packets as well but all the debate has been about the code in rxe_rcv_mcast_pkt()
because it is more blatant there or because I haven't been able to explain how it works well enough.

Bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-01 18:20                 ` Pearson, Robert B
@ 2021-03-01 18:27                   ` Jason Gunthorpe
  2021-03-02  8:11                     ` Leon Romanovsky
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2021-03-01 18:27 UTC (permalink / raw)
  To: Pearson, Robert B
  Cc: Bob Pearson, Leon Romanovsky, zyjzyj2000, linux-rdma, Zago, Frank

On Mon, Mar 01, 2021 at 06:20:06PM +0000, Pearson, Robert B wrote:
> > On Mon, Mar 01, 2021 at 10:54:21AM -0600, Bob Pearson wrote:
> 
> >> I agree that ib_device_get/put is attempting to solve a problem that 
> >> it not really very critical since ib_device is very unlikely to be 
> >> shut down in the middle of a data transfer. The driver never worried about this for years.
> >> But now that it's been put on the table it should be done right. A 
> >> data packet arriving is completely independent of the verbs API which 
> >> *could* delete all the QPs and shut down the HCA while it was 
> >> wondering around the universe or worse yet while the packet is being processed.
> 
> > If driver shutdown can guarentee that all pointers involved in
> > multicast are revoked before shutdown can finish then you don't
> > need this refcounting.
> 
> > It was only brought up because the API that returns the ib_device
> > from the netdev requires the refcounts as it is general purpose
> 
> Unfortunately what you ask for is exactly what the refcounting code
> accomplishes and I don't see a simpler way to get there.  This also
> applies to the non-multicast packets as well but all the debate has
> been about the code in rxe_rcv_mcast_pkt() because it is more
> blatant there or because I haven't been able to explain how it works
> well enough.

Usually in the netstack land the shutdown of the device flushes all
this parallel work out so all the dataplane can happily ignore all
these details.

I'm not so clear on all these details and how they apply to rxe of
course. You'd have to look at the full lifecycle of this skb and show
that the kfree_skb happens only before any unregistration finishes.

Most likely there are other bugs if the unregistration can pass while
the skb is still out there.

But, I'm not clear on how any of this works in rxe, this is just a
general remark on how things should ideally work.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-02-14 22:26 [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again) Bob Pearson
                   ` (2 preceding siblings ...)
  2021-02-26 23:28 ` Bob Pearson
@ 2021-03-02  5:19 ` Zhu Yanjun
  2021-03-02  7:26   ` Robert Pearson
  3 siblings, 1 reply; 17+ messages in thread
From: Zhu Yanjun @ 2021-03-02  5:19 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, RDMA mailing list, Bob Pearson

On Mon, Feb 15, 2021 at 6:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> Three errors occurred in the fix referenced below.
>
> 1) The on and off again 'if (skb)' got dropped but was really
> needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
> on the non-error path.
>
> 2) Extending the reference taken by rxe_get_dev_from_net() in
> rxe_udp_encap_recv() until each skb is freed was not matched by
> a reference in the loopback path resulting in underflows.
>
> 3) In rxe_comp.c the function free_pkt() did not clear skb which
> triggered a warning at done: and could possibly at exit: in
> rxe_completer(). The WARN_ONCE() calls are not required at done:
> and only in one place before going to exit.
>
> This patch fixes these errors.
>
> Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
> Signed-off-by: Bob Pearson <rpearson@hpe.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
>  drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
>  drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
>  3 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index a8ac791a1bb9..13fc5a1cced1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -671,6 +671,9 @@ int rxe_completer(void *arg)
>                          * it down the road or let it expire
>                          */
>
> +                       /* warn if we did receive a packet */
> +                       WARN_ON_ONCE(skb);
> +
>                         /* there is nothing to retry in this case */
>                         if (!wqe || (wqe->state == wqe_state_posted))
>                                 goto exit;
> @@ -750,7 +753,6 @@ int rxe_completer(void *arg)
>         /* we come here if we are done with processing and want the task to
>          * exit from the loop calling us
>          */
> -       WARN_ON_ONCE(skb);
>         rxe_drop_ref(qp);
>         return -EAGAIN;
>
> @@ -758,7 +760,6 @@ int rxe_completer(void *arg)
>         /* we come here if we have processed a packet we want the task to call
>          * us again to see if there is anything else to do
>          */
> -       WARN_ON_ONCE(skb);

When I keep "WARN_ON_ONCE(skb);", others are applied into -net, then I
run "rping " command. I got the followings.
It seems that SKBs are not freed.

"
[ 4068.003830] ------------[ cut here ]------------
[ 4068.003833] WARNING: CPU: 10 PID: 4241 at
drivers/infiniband/sw/rxe//rxe_comp.c:762 rxe_completer+0x982/0xd60
[rdma_rxe]
[ 4068.003845] Modules linked in: rdma_rxe(OE) rdma_ucm rdma_cm iw_cm
ib_cm ib_uverbs ib_core dm_multipath scsi_dh_rdac scsi_dh_emc
scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common nfit rapl
snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore
joydev input_leds serio_raw mac_hid sch_fq_codel ip_tables x_tables
autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid
vmwgfx ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
drm_kms_helper syscopyarea sysfillrect aesni_intel sysimgblt
fb_sys_fops crypto_simd psmouse cec cryptd ahci drm e1000 i2c_piix4
libahci pata_acpi video [last unloaded: rdma_rxe]
[ 4068.003898] CPU: 10 PID: 4241 Comm: rping Kdump: loaded Tainted: G
      W  OE     5.11.0+ #1
[ 4068.003901] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[ 4068.003902] RIP: 0010:rxe_completer+0x982/0xd60 [rdma_rxe]
[ 4068.003907] Code: 8d 7f 08 e8 70 94 39 de e9 f4 f6 ff ff 39 ca b8
0e 00 00 00 ba 03 00 00 00 0f 44 c2 89 c3 e9 44 f7 ff ff 0f 0b e9 af
f9 ff ff <0f> 0b e9 23 f9 ff ff 0f 0b e9 ef f9 ff ff 49 8d 9f 30 08 00
00 48
[ 4068.003909] RSP: 0018:ffffb3d7c0bf37f0 EFLAGS: 00010286
[ 4068.003911] RAX: 0000000000000008 RBX: 000000000000000e RCX: ffffffff9fa7fee8
[ 4068.003912] RDX: 0000000000000507 RSI: ffffffff9ef6e47e RDI: ffff939ace4b0000
[ 4068.003913] RBP: ffffb3d7c0bf3850 R08: ffff939ac30f1c00 R09: 0000000000000001
[ 4068.003915] R10: ffffb3d7c0bf3888 R11: 0000000000000000 R12: ffffb3d7c0127180
[ 4068.003916] R13: ffff939ad166a728 R14: 0000000000000000 R15: ffff939ad0fd8000
[ 4068.003917] FS:  00007f1f105b4740(0000) GS:ffff939dcfc80000(0000)
knlGS:0000000000000000
[ 4068.003919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4068.003920] CR2: 00007f9bc63f7fb8 CR3: 000000010e12a002 CR4: 00000000000706e0
[ 4068.003923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4068.003924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4068.003926] Call Trace:
[ 4068.003929]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
[ 4068.003934]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 4068.003938]  rxe_comp_queue_pkt+0x48/0x50 [rdma_rxe]
[ 4068.003942]  rxe_rcv+0x339/0x860 [rdma_rxe]
[ 4068.003946]  ? prepare_ack_packet+0x1b6/0x250 [rdma_rxe]
[ 4068.003951]  rxe_loopback+0x53/0x90 [rdma_rxe]
[ 4068.003955]  send_ack+0xac/0x170 [rdma_rxe]
[ 4068.003959]  rxe_responder+0x15bf/0x2220 [rdma_rxe]
[ 4068.003964]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
[ 4068.003968]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 4068.003972]  rxe_resp_queue_pkt+0x44/0x50 [rdma_rxe]
[ 4068.003976]  rxe_rcv+0x286/0x860 [rdma_rxe]
[ 4068.003979]  ? copy_data+0xc4/0x2a0 [rdma_rxe]
[ 4068.003984]  rxe_loopback+0x53/0x90 [rdma_rxe]
[ 4068.003987]  rxe_requester+0x6ec/0x10a0 [rdma_rxe]
[ 4068.003991]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 4068.003996]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
[ 4068.004000]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 4068.004004]  rxe_post_send+0x320/0x530 [rdma_rxe]
[ 4068.004008]  ? lookup_get_idr_uobject+0x19/0x30 [ib_uverbs]
[ 4068.004016]  ? rdma_lookup_get_uobject+0x47/0x180 [ib_uverbs]
[ 4068.004022]  ib_uverbs_post_send+0x64d/0x6e0 [ib_uverbs]
[ 4068.004028]  ? __wake_up+0x13/0x20
[ 4068.004033]  ib_uverbs_write+0x44f/0x580 [ib_uverbs]
[ 4068.004039]  vfs_write+0xb9/0x250
[ 4068.004043]  ksys_write+0xb1/0xe0
[ 4068.004045]  __x64_sys_write+0x1a/0x20
[ 4068.004048]  do_syscall_64+0x38/0x90
[ 4068.004052]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 4068.004054] RIP: 0033:0x7f1f1087f2cf
[ 4068.004056] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd
ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff
ff 48
[ 4068.004058] RSP: 002b:00007ffd14eb1d00 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 4068.004060] RAX: ffffffffffffffda RBX: 00007f1f0fcd6180 RCX: 00007f1f1087f2cf
[ 4068.004061] RDX: 0000000000000020 RSI: 00007ffd14eb1d60 RDI: 0000000000000004
[ 4068.004062] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000027
[ 4068.004063] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
[ 4068.004064] R13: 000055ad34999e80 R14: 0000000000000000 R15: 0000000000000000
[ 4068.004066] ---[ end trace 719f4d5687d4ac94 ]---
"

>         rxe_drop_ref(qp);
>         return 0;
>  }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 36d56163afac..8e81df578552 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -406,12 +406,17 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>
>  void rxe_loopback(struct sk_buff *skb)
>  {
> +       struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> +
>         if (skb->protocol == htons(ETH_P_IP))
>                 skb_pull(skb, sizeof(struct iphdr));
>         else
>                 skb_pull(skb, sizeof(struct ipv6hdr));
>
> -       rxe_rcv(skb);
> +       if (WARN_ON(!ib_device_try_get(&pkt->rxe->ib_dev)))
> +               kfree_skb(skb);
> +       else
> +               rxe_rcv(skb);
>  }
>
>  struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 8a48a33d587b..a5e330e3bbce 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -299,8 +299,10 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
>
>  err1:
>         /* free skb if not consumed */
> -       kfree_skb(skb);
> -       ib_device_put(&rxe->ib_dev);
> +       if (unlikely(skb)) {
> +               kfree_skb(skb);
> +               ib_device_put(&rxe->ib_dev);
> +       }
>  }
>
>  /**
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-02  5:19 ` Zhu Yanjun
@ 2021-03-02  7:26   ` Robert Pearson
  0 siblings, 0 replies; 17+ messages in thread
From: Robert Pearson @ 2021-03-02  7:26 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, RDMA mailing list, Bob Pearson

Resending in plain text (I hope)

Zhu,

You caught a bug of mine. Thanks.

Before the patch there were WARN_ON_ONCE() at exit: and done:
but almost all the goto done or goto exit followed calls to free_pkt()
which calls kfree_skb(skb) on the skb but may not
clear the local variable skb. This part of the patch was trying to
clean this up by getting rid of the warnings.
Remember pkt points to skb->cb[] so one can convert between skb and
pkt. There were two exceptions, one in case COMPST_EXIT
and one in COMPST_ERROR_RETRY where there is not a call to free_pkt()
before the goto.
COMPST_EXIT is returned from get_wqe() but only if pkt == NULL so
there is no skb in that case. However
COMPST_ERROR_RETRY is returned from check_psn() and check_ack() and in
both cases there are response packets
present!!
If !wqe || wqe->state != wqe_state_posted the original code went to
exit where it would have hit the warning so I tried to replace it with
one there in the code but it is in the wrong place. To match the
functionality of the original code it should have been in the if.
It should been

if (!wqe || (wqe->state == wqe_state_posted)) {
        WARN_ON_ONCE(skb);
        goto exit;
}

From the comment above this this looks like they were testing for a
"spurious" timer firing and just ignoring it. This will
cause the WARNING but it may not be critical. But since I put the
WARNING too early it will fire in cases  where that was
not the intent. If we move the WARN down a line or two it should fix
your current problem.

Bob


On Mon, Mar 1, 2021 at 11:19 PM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>
> On Mon, Feb 15, 2021 at 6:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > Three errors occurred in the fix referenced below.
> >
> > 1) The on and off again 'if (skb)' got dropped but was really
> > needed in rxe_rcv_mcast_pkt() to prevent calling ib_device_put()
> > on the non-error path.
> >
> > 2) Extending the reference taken by rxe_get_dev_from_net() in
> > rxe_udp_encap_recv() until each skb is freed was not matched by
> > a reference in the loopback path resulting in underflows.
> >
> > 3) In rxe_comp.c the function free_pkt() did not clear skb which
> > triggered a warning at done: and could possibly at exit: in
> > rxe_completer(). The WARN_ONCE() calls are not required at done:
> > and only in one place before going to exit.
> >
> > This patch fixes these errors.
> >
> > Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()")
> > Signed-off-by: Bob Pearson <rpearson@hpe.com>
> > ---
> >  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
> >  drivers/infiniband/sw/rxe/rxe_net.c  | 7 ++++++-
> >  drivers/infiniband/sw/rxe/rxe_recv.c | 6 ++++--
> >  3 files changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> > index a8ac791a1bb9..13fc5a1cced1 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> > @@ -671,6 +671,9 @@ int rxe_completer(void *arg)
> >                          * it down the road or let it expire
> >                          */
> >
> > +                       /* warn if we did receive a packet */
> > +                       WARN_ON_ONCE(skb);
> > +
> >                         /* there is nothing to retry in this case */
> >                         if (!wqe || (wqe->state == wqe_state_posted))
> >                                 goto exit;
> > @@ -750,7 +753,6 @@ int rxe_completer(void *arg)
> >         /* we come here if we are done with processing and want the task to
> >          * exit from the loop calling us
> >          */
> > -       WARN_ON_ONCE(skb);
> >         rxe_drop_ref(qp);
> >         return -EAGAIN;
> >
> > @@ -758,7 +760,6 @@ int rxe_completer(void *arg)
> >         /* we come here if we have processed a packet we want the task to call
> >          * us again to see if there is anything else to do
> >          */
> > -       WARN_ON_ONCE(skb);
>
> When I keep "WARN_ON_ONCE(skb);", others are applied into -net, then I
> run "rping " command. I got the followings.
> It seems that SKBs are not freed.
>
> "
> [ 4068.003830] ------------[ cut here ]------------
> [ 4068.003833] WARNING: CPU: 10 PID: 4241 at
> drivers/infiniband/sw/rxe//rxe_comp.c:762 rxe_completer+0x982/0xd60
> [rdma_rxe]
> [ 4068.003845] Modules linked in: rdma_rxe(OE) rdma_ucm rdma_cm iw_cm
> ib_cm ib_uverbs ib_core dm_multipath scsi_dh_rdac scsi_dh_emc
> scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common nfit rapl
> snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore
> joydev input_leds serio_raw mac_hid sch_fq_codel ip_tables x_tables
> autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid
> vmwgfx ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> drm_kms_helper syscopyarea sysfillrect aesni_intel sysimgblt
> fb_sys_fops crypto_simd psmouse cec cryptd ahci drm e1000 i2c_piix4
> libahci pata_acpi video [last unloaded: rdma_rxe]
> [ 4068.003898] CPU: 10 PID: 4241 Comm: rping Kdump: loaded Tainted: G
>       W  OE     5.11.0+ #1
> [ 4068.003901] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> VirtualBox 12/01/2006
> [ 4068.003902] RIP: 0010:rxe_completer+0x982/0xd60 [rdma_rxe]
> [ 4068.003907] Code: 8d 7f 08 e8 70 94 39 de e9 f4 f6 ff ff 39 ca b8
> 0e 00 00 00 ba 03 00 00 00 0f 44 c2 89 c3 e9 44 f7 ff ff 0f 0b e9 af
> f9 ff ff <0f> 0b e9 23 f9 ff ff 0f 0b e9 ef f9 ff ff 49 8d 9f 30 08 00
> 00 48
> [ 4068.003909] RSP: 0018:ffffb3d7c0bf37f0 EFLAGS: 00010286
> [ 4068.003911] RAX: 0000000000000008 RBX: 000000000000000e RCX: ffffffff9fa7fee8
> [ 4068.003912] RDX: 0000000000000507 RSI: ffffffff9ef6e47e RDI: ffff939ace4b0000
> [ 4068.003913] RBP: ffffb3d7c0bf3850 R08: ffff939ac30f1c00 R09: 0000000000000001
> [ 4068.003915] R10: ffffb3d7c0bf3888 R11: 0000000000000000 R12: ffffb3d7c0127180
> [ 4068.003916] R13: ffff939ad166a728 R14: 0000000000000000 R15: ffff939ad0fd8000
> [ 4068.003917] FS:  00007f1f105b4740(0000) GS:ffff939dcfc80000(0000)
> knlGS:0000000000000000
> [ 4068.003919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4068.003920] CR2: 00007f9bc63f7fb8 CR3: 000000010e12a002 CR4: 00000000000706e0
> [ 4068.003923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4068.003924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 4068.003926] Call Trace:
> [ 4068.003929]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
> [ 4068.003934]  rxe_run_task+0x2a/0x40 [rdma_rxe]
> [ 4068.003938]  rxe_comp_queue_pkt+0x48/0x50 [rdma_rxe]
> [ 4068.003942]  rxe_rcv+0x339/0x860 [rdma_rxe]
> [ 4068.003946]  ? prepare_ack_packet+0x1b6/0x250 [rdma_rxe]
> [ 4068.003951]  rxe_loopback+0x53/0x90 [rdma_rxe]
> [ 4068.003955]  send_ack+0xac/0x170 [rdma_rxe]
> [ 4068.003959]  rxe_responder+0x15bf/0x2220 [rdma_rxe]
> [ 4068.003964]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
> [ 4068.003968]  rxe_run_task+0x2a/0x40 [rdma_rxe]
> [ 4068.003972]  rxe_resp_queue_pkt+0x44/0x50 [rdma_rxe]
> [ 4068.003976]  rxe_rcv+0x286/0x860 [rdma_rxe]
> [ 4068.003979]  ? copy_data+0xc4/0x2a0 [rdma_rxe]
> [ 4068.003984]  rxe_loopback+0x53/0x90 [rdma_rxe]
> [ 4068.003987]  rxe_requester+0x6ec/0x10a0 [rdma_rxe]
> [ 4068.003991]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [ 4068.003996]  rxe_do_task+0xa7/0xf0 [rdma_rxe]
> [ 4068.004000]  rxe_run_task+0x2a/0x40 [rdma_rxe]
> [ 4068.004004]  rxe_post_send+0x320/0x530 [rdma_rxe]
> [ 4068.004008]  ? lookup_get_idr_uobject+0x19/0x30 [ib_uverbs]
> [ 4068.004016]  ? rdma_lookup_get_uobject+0x47/0x180 [ib_uverbs]
> [ 4068.004022]  ib_uverbs_post_send+0x64d/0x6e0 [ib_uverbs]
> [ 4068.004028]  ? __wake_up+0x13/0x20
> [ 4068.004033]  ib_uverbs_write+0x44f/0x580 [ib_uverbs]
> [ 4068.004039]  vfs_write+0xb9/0x250
> [ 4068.004043]  ksys_write+0xb1/0xe0
> [ 4068.004045]  __x64_sys_write+0x1a/0x20
> [ 4068.004048]  do_syscall_64+0x38/0x90
> [ 4068.004052]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 4068.004054] RIP: 0033:0x7f1f1087f2cf
> [ 4068.004056] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd
> ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff
> ff 48
> [ 4068.004058] RSP: 002b:00007ffd14eb1d00 EFLAGS: 00000293 ORIG_RAX:
> 0000000000000001
> [ 4068.004060] RAX: ffffffffffffffda RBX: 00007f1f0fcd6180 RCX: 00007f1f1087f2cf
> [ 4068.004061] RDX: 0000000000000020 RSI: 00007ffd14eb1d60 RDI: 0000000000000004
> [ 4068.004062] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000027
> [ 4068.004063] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
> [ 4068.004064] R13: 000055ad34999e80 R14: 0000000000000000 R15: 0000000000000000
> [ 4068.004066] ---[ end trace 719f4d5687d4ac94 ]---
> "
>
> >         rxe_drop_ref(qp);
> >         return 0;
> >  }
> > diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> > index 36d56163afac..8e81df578552 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_net.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> > @@ -406,12 +406,17 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
> >
> >  void rxe_loopback(struct sk_buff *skb)
> >  {
> > +       struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> > +
> >         if (skb->protocol == htons(ETH_P_IP))
> >                 skb_pull(skb, sizeof(struct iphdr));
> >         else
> >                 skb_pull(skb, sizeof(struct ipv6hdr));
> >
> > -       rxe_rcv(skb);
> > +       if (WARN_ON(!ib_device_try_get(&pkt->rxe->ib_dev)))
> > +               kfree_skb(skb);
> > +       else
> > +               rxe_rcv(skb);
> >  }
> >
> >  struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
> > diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> > index 8a48a33d587b..a5e330e3bbce 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> > @@ -299,8 +299,10 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
> >
> >  err1:
> >         /* free skb if not consumed */
> > -       kfree_skb(skb);
> > -       ib_device_put(&rxe->ib_dev);
> > +       if (unlikely(skb)) {
> > +               kfree_skb(skb);
> > +               ib_device_put(&rxe->ib_dev);
> > +       }
> >  }
> >
> >  /**
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again)
  2021-03-01 18:27                   ` Jason Gunthorpe
@ 2021-03-02  8:11                     ` Leon Romanovsky
  0 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-03-02  8:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pearson, Robert B, Bob Pearson, zyjzyj2000, linux-rdma, Zago, Frank

On Mon, Mar 01, 2021 at 02:27:11PM -0400, Jason Gunthorpe wrote:
> On Mon, Mar 01, 2021 at 06:20:06PM +0000, Pearson, Robert B wrote:
> > > On Mon, Mar 01, 2021 at 10:54:21AM -0600, Bob Pearson wrote:
> >
> > >> I agree that ib_device_get/put is attempting to solve a problem that
> > >> it not really very critical since ib_device is very unlikely to be
> > >> shut down in the middle of a data transfer. The driver never worried about this for years.
> > >> But now that it's been put on the table it should be done right. A
> > >> data packet arriving is completely independent of the verbs API which
> > >> *could* delete all the QPs and shut down the HCA while it was
> > >> wondering around the universe or worse yet while the packet is being processed.
> >
> > > If driver shutdown can guarentee that all pointers involved in
> > > multicast are revoked before shutdown can finish then you don't
> > > need this refcounting.
> >
> > > It was only brought up because the API that returns the ib_device
> > > from the netdev requires the refcounts as it is general purpose
> >
> > Unfortunately what you ask for is exactly what the refcounting code
> > accomplishes and I don't see a simpler way to get there.  This also
> > applies to the non-multicast packets as well but all the debate has
> > been about the code in rxe_rcv_mcast_pkt() because it is more
> > blatant there or because I haven't been able to explain how it works
> > well enough.
>
> Usually in the netstack land the shutdown of the device flushes all
> this parallel work out so all the dataplane can happily ignore all
> these details.
>
> I'm not so clear on all these details and how they apply to rxe of
> course. You'd have to look at the full lifecycle of this skb and show
> that the kfree_skb happens only before any unregistration finishes.
>
> Most likely there are other bugs if the unregistration can pass while
> the skb is still out there.
>
> But, I'm not clear on how any of this works in rxe, this is just a
> general remark on how things should ideally work.

+1, I have same understanding and expect SKB to be flushed and new SKB
are prevented from entering ib_device if it is going under destroy.

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-03-02 17:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-14 22:26 [PATCH for-next] RDMA/rxe: Fix ib_device reference counting (again) Bob Pearson
2021-02-15  3:46 ` Zhu Yanjun
2021-02-15  5:59 ` Leon Romanovsky
2021-02-26 23:28 ` Bob Pearson
2021-02-26 23:33   ` Jason Gunthorpe
2021-02-27  0:02     ` Bob Pearson
2021-02-27  8:43       ` Leon Romanovsky
2021-02-28 17:04         ` Bob Pearson
2021-03-01  7:24           ` Leon Romanovsky
2021-03-01 16:54             ` Bob Pearson
2021-03-01 17:35               ` Jason Gunthorpe
2021-03-01 18:20                 ` Pearson, Robert B
2021-03-01 18:27                   ` Jason Gunthorpe
2021-03-02  8:11                     ` Leon Romanovsky
2021-03-01  7:42           ` Zhu Yanjun
2021-03-02  5:19 ` Zhu Yanjun
2021-03-02  7:26   ` Robert Pearson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.