All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
@ 2021-06-04 23:05 Bob Pearson
  2021-06-04 23:18 ` Bob Pearson
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Bob Pearson @ 2021-06-04 23:05 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently the rdma_rxe driver attempts to protect atomic responder
resources by taking a reference to the qp which is only freed when the
resource is recycled for a new read or atomic operation. This means that
in normal circumstances there is almost always an extra qp reference
once an atomic operation has been executed which prevents cleaning up
the qp and associated pd and cqs when the qp is destroyed.

This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
destroyed while a peer is retrying an atomic op it will cause the
operation to fail which is acceptable.

Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c   | 1 -
 drivers/infiniband/sw/rxe/rxe_resp.c | 2 --
 2 files changed, 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 34ae957a315c..b6d83d82e4f9 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -136,7 +136,6 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
 void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
 {
 	if (res->type == RXE_ATOMIC_MASK) {
-		rxe_drop_ref(qp);
 		kfree_skb(res->atomic.skb);
 	} else if (res->type == RXE_READ_MASK) {
 		if (res->read.mr)
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 2b220659bddb..39dc39be586e 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -966,8 +966,6 @@ static int send_atomic_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 		goto out;
 	}
 
-	rxe_add_ref(qp);
-
 	res = &qp->resp.resources[qp->resp.res_head];
 	free_rd_atomic_resource(qp, res);
 	rxe_advance_resp_resource(qp);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-04 23:05 [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops Bob Pearson
@ 2021-06-04 23:18 ` Bob Pearson
  2021-06-07  8:16 ` Zhu Yanjun
  2021-06-16 23:21 ` Jason Gunthorpe
  2 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2021-06-04 23:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma

On 6/4/21 6:05 PM, Bob Pearson wrote:
> Currently the rdma_rxe driver attempts to protect atomic responder
> resources by taking a reference to the qp which is only freed when the
> resource is recycled for a new read or atomic operation. This means that
> in normal circumstances there is almost always an extra qp reference
> once an atomic operation has been executed which prevents cleaning up
> the qp and associated pd and cqs when the qp is destroyed.
> 
> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> destroyed while a peer is retrying an atomic op it will cause the
> operation to fail which is acceptable.
> 
> Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
> Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_qp.c   | 1 -
>  drivers/infiniband/sw/rxe/rxe_resp.c | 2 --
>  2 files changed, 3 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 34ae957a315c..b6d83d82e4f9 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -136,7 +136,6 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
>  void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
>  {
>  	if (res->type == RXE_ATOMIC_MASK) {
> -		rxe_drop_ref(qp);
>  		kfree_skb(res->atomic.skb);
>  	} else if (res->type == RXE_READ_MASK) {
>  		if (res->read.mr)
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 2b220659bddb..39dc39be586e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -966,8 +966,6 @@ static int send_atomic_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
>  		goto out;
>  	}
>  
> -	rxe_add_ref(qp);
> -
>  	res = &qp->resp.resources[qp->resp.res_head];
>  	free_rd_atomic_resource(qp, res);
>  	rxe_advance_resp_resource(qp);
> 
There is another way to fix this which is to add cleanup_rd_atomic_resources to rxe_qp_destroy() but it has the exact same behavior because it removes the responder resource causing a retried operation to fail. You can't add it to the qp cleanup routines because you never get there until the ref count goes to zero. I think the above is the cleanest solution.

Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-04 23:05 [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops Bob Pearson
  2021-06-04 23:18 ` Bob Pearson
@ 2021-06-07  8:16 ` Zhu Yanjun
  2021-06-07 11:03   ` Leon Romanovsky
  2021-06-07 16:11   ` Pearson, Robert B
  2021-06-16 23:21 ` Jason Gunthorpe
  2 siblings, 2 replies; 12+ messages in thread
From: Zhu Yanjun @ 2021-06-07  8:16 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, RDMA mailing list

On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> Currently the rdma_rxe driver attempts to protect atomic responder
> resources by taking a reference to the qp which is only freed when the
> resource is recycled for a new read or atomic operation. This means that
> in normal circumstances there is almost always an extra qp reference
> once an atomic operation has been executed which prevents cleaning up
> the qp and associated pd and cqs when the qp is destroyed.
>
> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is

Not sure if it is a good way to fix this problem by removing the call
to rxe_add_ref.
Because taking a reference to the qp is to protect atomic responder resources.

Removing rxe_add_ref is to decrease the protection of the atomic
responder resources.

Zhu Yanjun

> destroyed while a peer is retrying an atomic op it will cause the
> operation to fail which is acceptable.
>
> Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
> Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_qp.c   | 1 -
>  drivers/infiniband/sw/rxe/rxe_resp.c | 2 --
>  2 files changed, 3 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 34ae957a315c..b6d83d82e4f9 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -136,7 +136,6 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
>  void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
>  {
>         if (res->type == RXE_ATOMIC_MASK) {
> -               rxe_drop_ref(qp);
>                 kfree_skb(res->atomic.skb);
>         } else if (res->type == RXE_READ_MASK) {
>                 if (res->read.mr)
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 2b220659bddb..39dc39be586e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -966,8 +966,6 @@ static int send_atomic_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
>                 goto out;
>         }
>
> -       rxe_add_ref(qp);
> -
>         res = &qp->resp.resources[qp->resp.res_head];
>         free_rd_atomic_resource(qp, res);
>         rxe_advance_resp_resource(qp);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-07  8:16 ` Zhu Yanjun
@ 2021-06-07 11:03   ` Leon Romanovsky
  2021-06-07 11:12     ` Zhu Yanjun
  2021-06-07 16:11   ` Pearson, Robert B
  1 sibling, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2021-06-07 11:03 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Bob Pearson, Jason Gunthorpe, RDMA mailing list

On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > Currently the rdma_rxe driver attempts to protect atomic responder
> > resources by taking a reference to the qp which is only freed when the
> > resource is recycled for a new read or atomic operation. This means that
> > in normal circumstances there is almost always an extra qp reference
> > once an atomic operation has been executed which prevents cleaning up
> > the qp and associated pd and cqs when the qp is destroyed.
> >
> > This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> > call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> 
> Not sure if it is a good way to fix this problem by removing the call
> to rxe_add_ref.
> Because taking a reference to the qp is to protect atomic responder resources.
> 
> Removing rxe_add_ref is to decrease the protection of the atomic
> responder resources.

All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-07 11:03   ` Leon Romanovsky
@ 2021-06-07 11:12     ` Zhu Yanjun
  2021-06-07 16:14       ` Pearson, Robert B
  0 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2021-06-07 11:12 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Bob Pearson, Jason Gunthorpe, RDMA mailing list

On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
> > On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> > >
> > > Currently the rdma_rxe driver attempts to protect atomic responder
> > > resources by taking a reference to the qp which is only freed when the
> > > resource is recycled for a new read or atomic operation. This means that
> > > in normal circumstances there is almost always an extra qp reference
> > > once an atomic operation has been executed which prevents cleaning up
> > > the qp and associated pd and cqs when the qp is destroyed.
> > >
> > > This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> > > call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> >
> > Not sure if it is a good way to fix this problem by removing the call
> > to rxe_add_ref.
> > Because taking a reference to the qp is to protect atomic responder resources.
> >
> > Removing rxe_add_ref is to decrease the protection of the atomic
> > responder resources.
>
> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
>

I made tests with this commit. After this commit is applied, this
problem disappeared.

Zhu Yanjun

> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-07  8:16 ` Zhu Yanjun
  2021-06-07 11:03   ` Leon Romanovsky
@ 2021-06-07 16:11   ` Pearson, Robert B
  1 sibling, 0 replies; 12+ messages in thread
From: Pearson, Robert B @ 2021-06-07 16:11 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, RDMA mailing list


On 6/7/2021 3:16 AM, Zhu Yanjun wrote:
> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>> Currently the rdma_rxe driver attempts to protect atomic responder
>> resources by taking a reference to the qp which is only freed when the
>> resource is recycled for a new read or atomic operation. This means that
>> in normal circumstances there is almost always an extra qp reference
>> once an atomic operation has been executed which prevents cleaning up
>> the qp and associated pd and cqs when the qp is destroyed.
>>
>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> Not sure if it is a good way to fix this problem by removing the call
> to rxe_add_ref.
> Because taking a reference to the qp is to protect atomic responder resources.

There is no good way to 'protect' responder resources. They are there to 
recover from a lost response packet which can occur multiple times. The 
peer can retry the atomic request an unpredictable number of times. 
There are no acks for response packets. So the QP just has to leave it 
in place until the max number of read/atomic requests has been received 
and then it re-uses the responder resource. So to 'protect' the 
responder resource means you have to leave the QP around potentially 
forever which is the root cause of the bug. In fact you should treat a 
retry of a read/atomic the same as a new request and when the QP is 
destroyed it stops responding to new requests.

There is a reference to the QP taken by a received packet which lasts 
until the packet is freed to prevent kernel seg faults if the QP is 
destroyed while a request is in process of being responded to.

>
> Removing rxe_add_ref is to decrease the protection of the atomic
> responder resources.
>
> Zhu Yanjun
>
>> destroyed while a peer is retrying an atomic op it will cause the
>> operation to fail which is acceptable.
>>
>> Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
>> Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_qp.c   | 1 -
>>   drivers/infiniband/sw/rxe/rxe_resp.c | 2 --
>>   2 files changed, 3 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
>> index 34ae957a315c..b6d83d82e4f9 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
>> @@ -136,7 +136,6 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
>>   void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
>>   {
>>          if (res->type == RXE_ATOMIC_MASK) {
>> -               rxe_drop_ref(qp);
>>                  kfree_skb(res->atomic.skb);
>>          } else if (res->type == RXE_READ_MASK) {
>>                  if (res->read.mr)
>> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
>> index 2b220659bddb..39dc39be586e 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
>> @@ -966,8 +966,6 @@ static int send_atomic_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
>>                  goto out;
>>          }
>>
>> -       rxe_add_ref(qp);
>> -
>>          res = &qp->resp.resources[qp->resp.res_head];
>>          free_rd_atomic_resource(qp, res);
>>          rxe_advance_resp_resource(qp);
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-07 11:12     ` Zhu Yanjun
@ 2021-06-07 16:14       ` Pearson, Robert B
  2021-06-08  1:39         ` Zhu Yanjun
  0 siblings, 1 reply; 12+ messages in thread
From: Pearson, Robert B @ 2021-06-07 16:14 UTC (permalink / raw)
  To: Zhu Yanjun, Leon Romanovsky; +Cc: Jason Gunthorpe, RDMA mailing list


On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
> On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
>> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
>>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>> Currently the rdma_rxe driver attempts to protect atomic responder
>>>> resources by taking a reference to the qp which is only freed when the
>>>> resource is recycled for a new read or atomic operation. This means that
>>>> in normal circumstances there is almost always an extra qp reference
>>>> once an atomic operation has been executed which prevents cleaning up
>>>> the qp and associated pd and cqs when the qp is destroyed.
>>>>
>>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
>>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
>>> Not sure if it is a good way to fix this problem by removing the call
>>> to rxe_add_ref.
>>> Because taking a reference to the qp is to protect atomic responder resources.
>>>
>>> Removing rxe_add_ref is to decrease the protection of the atomic
>>> responder resources.
>> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
>>
> I made tests with this commit. After this commit is applied, this
> problem disappeared.
You were testing MW when you saw this bug. Does that mean that now MW is 
working for you?
>
> Zhu Yanjun
>
>> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-07 16:14       ` Pearson, Robert B
@ 2021-06-08  1:39         ` Zhu Yanjun
  2021-06-08  2:01           ` Pearson, Robert B
  0 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2021-06-08  1:39 UTC (permalink / raw)
  To: Pearson, Robert B; +Cc: Leon Romanovsky, Jason Gunthorpe, RDMA mailing list

On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
>
>
> On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
> > On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
> >> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
> >>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >>>> Currently the rdma_rxe driver attempts to protect atomic responder
> >>>> resources by taking a reference to the qp which is only freed when the
> >>>> resource is recycled for a new read or atomic operation. This means that
> >>>> in normal circumstances there is almost always an extra qp reference
> >>>> once an atomic operation has been executed which prevents cleaning up
> >>>> the qp and associated pd and cqs when the qp is destroyed.
> >>>>
> >>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> >>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> >>> Not sure if it is a good way to fix this problem by removing the call
> >>> to rxe_add_ref.
> >>> Because taking a reference to the qp is to protect atomic responder resources.
> >>>
> >>> Removing rxe_add_ref is to decrease the protection of the atomic
> >>> responder resources.
> >> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
> >>
> > I made tests with this commit. After this commit is applied, this
> > problem disappeared.
> You were testing MW when you saw this bug. Does that mean that now MW is
> working for you?

Your MW patches are huge. After these patches are applied, I found 2
problems in my test environment.
So IMO, can you send the test cases about MW to rdma-core? So we can
verify these MW patches with them.

In previous mails, you mentioned these MW test cases.

Thanks a lot.
Zhu Yanjun

> >
> > Zhu Yanjun
> >
> >> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-08  1:39         ` Zhu Yanjun
@ 2021-06-08  2:01           ` Pearson, Robert B
  2021-06-08  3:48             ` Zhu Yanjun
  0 siblings, 1 reply; 12+ messages in thread
From: Pearson, Robert B @ 2021-06-08  2:01 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Leon Romanovsky, Jason Gunthorpe, RDMA mailing list


On 6/7/2021 8:39 PM, Zhu Yanjun wrote:
> On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
>>
>> On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
>>> On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
>>>> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
>>>>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>>> Currently the rdma_rxe driver attempts to protect atomic responder
>>>>>> resources by taking a reference to the qp which is only freed when the
>>>>>> resource is recycled for a new read or atomic operation. This means that
>>>>>> in normal circumstances there is almost always an extra qp reference
>>>>>> once an atomic operation has been executed which prevents cleaning up
>>>>>> the qp and associated pd and cqs when the qp is destroyed.
>>>>>>
>>>>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
>>>>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
>>>>> Not sure if it is a good way to fix this problem by removing the call
>>>>> to rxe_add_ref.
>>>>> Because taking a reference to the qp is to protect atomic responder resources.
>>>>>
>>>>> Removing rxe_add_ref is to decrease the protection of the atomic
>>>>> responder resources.
>>>> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
>>>>
>>> I made tests with this commit. After this commit is applied, this
>>> problem disappeared.
>> You were testing MW when you saw this bug. Does that mean that now MW is
>> working for you?
> Your MW patches are huge. After these patches are applied, I found 2
> problems in my test environment.

The trace you showed looked like the pyverbs tests all passed and then 
there were leaked QP/PD/CQ. I also saw those. After fixing the QP 
reference count bug (not in MW) I did not see any errors from the 
pyverbs tests of MW. Or any other errors for that matter. What was the 
other problem? Was that the memory barrier one (also not in MW)?

Mostly I want to know if you currently see any errors in the kernel 
related to MW. The test case bug (in test_qpex.py) is a separate issue 
that is not a rxe bug at all.

Bob

> So IMO, can you send the test cases about MW to rdma-core? So we can
> verify these MW patches with them.
>
> In previous mails, you mentioned these MW test cases.
>
> Thanks a lot.
> Zhu Yanjun
>
>>> Zhu Yanjun
>>>
>>>> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-08  2:01           ` Pearson, Robert B
@ 2021-06-08  3:48             ` Zhu Yanjun
  2021-06-08  4:49               ` Pearson, Robert B
  0 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2021-06-08  3:48 UTC (permalink / raw)
  To: Pearson, Robert B; +Cc: Leon Romanovsky, Jason Gunthorpe, RDMA mailing list

On Tue, Jun 8, 2021 at 10:01 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
>
>
> On 6/7/2021 8:39 PM, Zhu Yanjun wrote:
> > On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
> >>
> >> On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
> >>> On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
> >>>> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
> >>>>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >>>>>> Currently the rdma_rxe driver attempts to protect atomic responder
> >>>>>> resources by taking a reference to the qp which is only freed when the
> >>>>>> resource is recycled for a new read or atomic operation. This means that
> >>>>>> in normal circumstances there is almost always an extra qp reference
> >>>>>> once an atomic operation has been executed which prevents cleaning up
> >>>>>> the qp and associated pd and cqs when the qp is destroyed.
> >>>>>>
> >>>>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> >>>>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> >>>>> Not sure if it is a good way to fix this problem by removing the call
> >>>>> to rxe_add_ref.
> >>>>> Because taking a reference to the qp is to protect atomic responder resources.
> >>>>>
> >>>>> Removing rxe_add_ref is to decrease the protection of the atomic
> >>>>> responder resources.
> >>>> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
> >>>>
> >>> I made tests with this commit. After this commit is applied, this
> >>> problem disappeared.
> >> You were testing MW when you saw this bug. Does that mean that now MW is
> >> working for you?
> > Your MW patches are huge. After these patches are applied, I found 2
> > problems in my test environment.
>
> The trace you showed looked like the pyverbs tests all passed and then
> there were leaked QP/PD/CQ. I also saw those. After fixing the QP
> reference count bug (not in MW) I did not see any errors from the
> pyverbs tests of MW. Or any other errors for that matter. What was the
> other problem? Was that the memory barrier one (also not in MW)?
>
> Mostly I want to know if you currently see any errors in the kernel
> related to MW. The test case bug (in test_qpex.py) is a separate issue

The current test cases in rdma-core just confirm a regression in RXE.

Zhu Yanjun

> that is not a rxe bug at all.
>
> Bob
>
> > So IMO, can you send the test cases about MW to rdma-core? So we can
> > verify these MW patches with them.
> >
> > In previous mails, you mentioned these MW test cases.
> >
> > Thanks a lot.
> > Zhu Yanjun
> >
> >>> Zhu Yanjun
> >>>
> >>>> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-08  3:48             ` Zhu Yanjun
@ 2021-06-08  4:49               ` Pearson, Robert B
  0 siblings, 0 replies; 12+ messages in thread
From: Pearson, Robert B @ 2021-06-08  4:49 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Leon Romanovsky, Jason Gunthorpe, RDMA mailing list


On 6/7/2021 10:48 PM, Zhu Yanjun wrote:
> On Tue, Jun 8, 2021 at 10:01 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
>>
>> On 6/7/2021 8:39 PM, Zhu Yanjun wrote:
>>> On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@gmail.com> wrote:
>>>> On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
>>>>> On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@kernel.org> wrote:
>>>>>> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
>>>>>>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>>>>> Currently the rdma_rxe driver attempts to protect atomic responder
>>>>>>>> resources by taking a reference to the qp which is only freed when the
>>>>>>>> resource is recycled for a new read or atomic operation. This means that
>>>>>>>> in normal circumstances there is almost always an extra qp reference
>>>>>>>> once an atomic operation has been executed which prevents cleaning up
>>>>>>>> the qp and associated pd and cqs when the qp is destroyed.
>>>>>>>>
>>>>>>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
>>>>>>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
>>>>>>> Not sure if it is a good way to fix this problem by removing the call
>>>>>>> to rxe_add_ref.
>>>>>>> Because taking a reference to the qp is to protect atomic responder resources.
>>>>>>>
>>>>>>> Removing rxe_add_ref is to decrease the protection of the atomic
>>>>>>> responder resources.
>>>>>> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
>>>>>>
>>>>> I made tests with this commit. After this commit is applied, this
>>>>> problem disappeared.
>>>> You were testing MW when you saw this bug. Does that mean that now MW is
>>>> working for you?
>>> Your MW patches are huge. After these patches are applied, I found 2
>>> problems in my test environment.
>> The trace you showed looked like the pyverbs tests all passed and then
>> there were leaked QP/PD/CQ. I also saw those. After fixing the QP
>> reference count bug (not in MW) I did not see any errors from the
>> pyverbs tests of MW. Or any other errors for that matter. What was the
>> other problem? Was that the memory barrier one (also not in MW)?
>>
>> Mostly I want to know if you currently see any errors in the kernel
>> related to MW. The test case bug (in test_qpex.py) is a separate issue
> The current test cases in rdma-core just confirm a regression in RXE.
>
> Zhu Yanjun

Which test cases are you referring to. Currently all test cases either 
pass or are skipped because they are not supported with one single 
exception. That test in test_qpex.py is *not* a regression. It used to 
skip until I added support for the extended MW bind operation to the 
user code today. It now fails because the test is actually wrong. It 
didn't set the access flags for the MR to support bind MW so the driver 
fails the WR with a bind MW error which is the correct behavior. The 
traditional QP WR API (ibv_post_send) exercises the same exact 
functionality of the driver and they all set the MR access correctly and 
pass. There are *no* actual errors being reported by the rdma-core tests.

Bob

>
>> that is not a rxe bug at all.
>>
>> Bob
>>
>>> So IMO, can you send the test cases about MW to rdma-core? So we can
>>> verify these MW patches with them.
>>>
>>> In previous mails, you mentioned these MW test cases.
>>>
>>> Thanks a lot.
>>> Zhu Yanjun
>>>
>>>>> Zhu Yanjun
>>>>>
>>>>>> Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops
  2021-06-04 23:05 [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops Bob Pearson
  2021-06-04 23:18 ` Bob Pearson
  2021-06-07  8:16 ` Zhu Yanjun
@ 2021-06-16 23:21 ` Jason Gunthorpe
  2 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2021-06-16 23:21 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Jun 04, 2021 at 06:05:59PM -0500, Bob Pearson wrote:
> Currently the rdma_rxe driver attempts to protect atomic responder
> resources by taking a reference to the qp which is only freed when the
> resource is recycled for a new read or atomic operation. This means that
> in normal circumstances there is almost always an extra qp reference
> once an atomic operation has been executed which prevents cleaning up
> the qp and associated pd and cqs when the qp is destroyed.
> 
> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
> destroyed while a peer is retrying an atomic op it will cause the
> operation to fail which is acceptable.
> 
> Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
> Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_qp.c   | 1 -
>  drivers/infiniband/sw/rxe/rxe_resp.c | 2 --
>  2 files changed, 3 deletions(-)

Applied to for-next, thanks

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-16 23:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-04 23:05 [PATCH for-next] RDMA/rxe: Fix qp reference counting for atomic ops Bob Pearson
2021-06-04 23:18 ` Bob Pearson
2021-06-07  8:16 ` Zhu Yanjun
2021-06-07 11:03   ` Leon Romanovsky
2021-06-07 11:12     ` Zhu Yanjun
2021-06-07 16:14       ` Pearson, Robert B
2021-06-08  1:39         ` Zhu Yanjun
2021-06-08  2:01           ` Pearson, Robert B
2021-06-08  3:48             ` Zhu Yanjun
2021-06-08  4:49               ` Pearson, Robert B
2021-06-07 16:11   ` Pearson, Robert B
2021-06-16 23:21 ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.