linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blktest/rxe almost working
@ 2021-09-02 21:41 Bob Pearson
  2021-09-02 23:38 ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2021-09-02 21:41 UTC (permalink / raw)
  To: Bart Van Assche, Jason Gunthorpe, linux-rdma

Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to working for rxe but there is
still one error. After adding MW support I added a test to local invalidate to check and see if the l/rkey matched the key actually contained in the MR/MW when local invalidate is called. This is failing
for srp/002 with the key portion of the rkey off by one. Looking at ib_srp.c I see code that does in fact
increment the rkey by one and also has code that posts a local invalidate. This was never checked before
and is now failing to match. If I mask off the key portion in the test the whole test case passes so
the other problems appear to have been fixed. If the increment and invalidate are out of sync this could
result in the error. I suspect this may be a bug in srp. Worst case I can remove this test but I would
rather not.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-02 21:41 blktest/rxe almost working Bob Pearson
@ 2021-09-02 23:38 ` Jason Gunthorpe
  2021-09-03 22:18   ` Bob Pearson
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-09-02 23:38 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Bart Van Assche, linux-rdma

On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
> working for rxe but there is still one error. After adding MW
> support I added a test to local invalidate to check and see if the
> l/rkey matched the key actually contained in the MR/MW when local
> invalidate is called. This is failing for srp/002 with the key
> portion of the rkey off by one. Looking at ib_srp.c I see code that
> does in fact increment the rkey by one and also has code that posts
> a local invalidate. This was never checked before and is now failing
> to match. If I mask off the key portion in the test the whole test
> case passes so the other problems appear to have been fixed. If the
> increment and invalidate are out of sync this could result in the
> error. I suspect this may be a bug in srp. Worst case I can remove
> this test but I would rather not.

I didn't check the spec, but since SRP works with HW devices I wonder
if invalidation is supposed to ignore the variant bits in the mkey?

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-02 23:38 ` Jason Gunthorpe
@ 2021-09-03 22:18   ` Bob Pearson
  2021-09-03 23:13     ` Bart Van Assche
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2021-09-03 22:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Bart Van Assche; +Cc: linux-rdma

On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>> working for rxe but there is still one error. After adding MW
>> support I added a test to local invalidate to check and see if the
>> l/rkey matched the key actually contained in the MR/MW when local
>> invalidate is called. This is failing for srp/002 with the key
>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>> does in fact increment the rkey by one and also has code that posts
>> a local invalidate. This was never checked before and is now failing
>> to match. If I mask off the key portion in the test the whole test
>> case passes so the other problems appear to have been fixed. If the
>> increment and invalidate are out of sync this could result in the
>> error. I suspect this may be a bug in srp. Worst case I can remove
>> this test but I would rather not.
> 
> I didn't check the spec, but since SRP works with HW devices I wonder
> if invalidation is supposed to ignore the variant bits in the mkey?
> 
> Jason
> 

I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
before the MR is turned on again.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-03 22:18   ` Bob Pearson
@ 2021-09-03 23:13     ` Bart Van Assche
  2021-09-04 22:30       ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2021-09-03 23:13 UTC (permalink / raw)
  To: Bob Pearson, Jason Gunthorpe; +Cc: linux-rdma

On 9/3/21 3:18 PM, Bob Pearson wrote:
> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>>> working for rxe but there is still one error. After adding MW
>>> support I added a test to local invalidate to check and see if the
>>> l/rkey matched the key actually contained in the MR/MW when local
>>> invalidate is called. This is failing for srp/002 with the key
>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>>> does in fact increment the rkey by one and also has code that posts
>>> a local invalidate. This was never checked before and is now failing
>>> to match. If I mask off the key portion in the test the whole test
>>> case passes so the other problems appear to have been fixed. If the
>>> increment and invalidate are out of sync this could result in the
>>> error. I suspect this may be a bug in srp. Worst case I can remove
>>> this test but I would rather not.
>>
>> I didn't check the spec, but since SRP works with HW devices I wonder
>> if invalidation is supposed to ignore the variant bits in the mkey?
> 
> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
> before the MR is turned on again.

Hi Bob,

If there would be any code in the SRP driver that is not compliant with 
the IBTA specification then I can fix it.

Regarding the invalidate work requests submitted by the ib_srp driver: 
these are submitted before srp_fr_pool_put() is called. A new 
registration request is submitted after srp_fr_pool_get() succeeds. 
There is one MR pool per RDMA channel and there is one QP per RDMA 
channel. In other words, (re)registration requests are submitted to the 
same QP as unregistration requests after local invalidate requests. I 
think the IBTA requires does not allow to reorder a local invalidate 
followed by a fast registration request.

Bart.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-03 23:13     ` Bart Van Assche
@ 2021-09-04 22:30       ` Jason Gunthorpe
  2021-09-05 18:02         ` Bob Pearson
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-09-04 22:30 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Bob Pearson, linux-rdma

On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
> On 9/3/21 3:18 PM, Bob Pearson wrote:
> > On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
> > > On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
> > > > Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
> > > > working for rxe but there is still one error. After adding MW
> > > > support I added a test to local invalidate to check and see if the
> > > > l/rkey matched the key actually contained in the MR/MW when local
> > > > invalidate is called. This is failing for srp/002 with the key
> > > > portion of the rkey off by one. Looking at ib_srp.c I see code that
> > > > does in fact increment the rkey by one and also has code that posts
> > > > a local invalidate. This was never checked before and is now failing
> > > > to match. If I mask off the key portion in the test the whole test
> > > > case passes so the other problems appear to have been fixed. If the
> > > > increment and invalidate are out of sync this could result in the
> > > > error. I suspect this may be a bug in srp. Worst case I can remove
> > > > this test but I would rather not.
> > > 
> > > I didn't check the spec, but since SRP works with HW devices I wonder
> > > if invalidation is supposed to ignore the variant bits in the mkey?
> > 
> > I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
> > MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
> > that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
> > to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
> > don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
> > and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
> > before the MR is turned on again.
> 
> Hi Bob,
> 
> If there would be any code in the SRP driver that is not compliant with the
> IBTA specification then I can fix it.
> 
> Regarding the invalidate work requests submitted by the ib_srp driver: these
> are submitted before srp_fr_pool_put() is called. A new registration request
> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
> channel and there is one QP per RDMA channel. In other words,
> (re)registration requests are submitted to the same QP as unregistration
> requests after local invalidate requests. I think the IBTA requires does not
> allow to reorder a local invalidate followed by a fast registration request.

Right

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-04 22:30       ` Jason Gunthorpe
@ 2021-09-05 18:02         ` Bob Pearson
  2021-09-07 12:01           ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2021-09-05 18:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Bart Van Assche; +Cc: linux-rdma

On 9/4/21 5:30 PM, Jason Gunthorpe wrote:
> On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
>> On 9/3/21 3:18 PM, Bob Pearson wrote:
>>> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
>>>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>>>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>>>>> working for rxe but there is still one error. After adding MW
>>>>> support I added a test to local invalidate to check and see if the
>>>>> l/rkey matched the key actually contained in the MR/MW when local
>>>>> invalidate is called. This is failing for srp/002 with the key
>>>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>>>>> does in fact increment the rkey by one and also has code that posts
>>>>> a local invalidate. This was never checked before and is now failing
>>>>> to match. If I mask off the key portion in the test the whole test
>>>>> case passes so the other problems appear to have been fixed. If the
>>>>> increment and invalidate are out of sync this could result in the
>>>>> error. I suspect this may be a bug in srp. Worst case I can remove
>>>>> this test but I would rather not.
>>>>
>>>> I didn't check the spec, but since SRP works with HW devices I wonder
>>>> if invalidation is supposed to ignore the variant bits in the mkey?
>>>
>>> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
>>> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
>>> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
>>> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
>>> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
>>> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
>>> before the MR is turned on again.
>>
>> Hi Bob,
>>
>> If there would be any code in the SRP driver that is not compliant with the
>> IBTA specification then I can fix it.
>>
>> Regarding the invalidate work requests submitted by the ib_srp driver: these
>> are submitted before srp_fr_pool_put() is called. A new registration request
>> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
>> channel and there is one QP per RDMA channel. In other words,
>> (re)registration requests are submitted to the same QP as unregistration
>> requests after local invalidate requests. I think the IBTA requires does not
>> allow to reorder a local invalidate followed by a fast registration request.
> 
> Right
> 
> Jason
> 

srp_inv_rkey()
	wr = ...			builds local invalidate WR
	wr.send_flags = 0		i.e. not signaled
	ib_post_send()			posts the WR for delayed execution

srp_unmap_data()
	srp_inv_rkey()			schedules invalidate of each rkey in req
	srp_fr_pool_put()		puts each desc entry on free list

srp_map_finish_fr()
	...				misc checks not relevant
	desc = srp_fr_pool_get()	returns desc from free list
	rkey = ib_inc_rkey()		gets a new rkey one larger than the last one
	ib_update_fast_reg_key()	immediately changes mr->rkey to new value
	ib_map_mr_sg()			immediately updates buffer list in MR to new values
	wr = ...			set WR to REG_MR work request not signaled
	wr.key = new rkey
	ib_post_send()			wr is posted for delayed execution

So as soon as the MR has had a WR posted to invalidate it the code goes ahead and adds it to the
free list and then as soon as a new MR is gotten from the free list the rkey and mappings are
changed and then a WR is posted to 'register' the MR which marks it as valid again. The register
WR *also* resets the rkey which is redundant with the ib_update_fast_reg_key() call.

All the work except for setting the state valid is done immediately regardless of the status of the
completion of the previous invalidate and can complete before the MR is marked FREE. Because the WR
is not signaled no one is checking the WC for these operations unless there is an error.

The old code worked because the key part of the rkey wasn't checked for the invalidate. By changing
the rkey before the mappings random stray old RDMA operations will fail because the rkey is not
matching and not because the MR is not VALID. There is a theoretical risk here because the MR could
be accessed through the new rkey with either the new or old mappings or a mixture while the MR is
still VALID on the old mapping before the invalidate succeeds.

Many years ago when I first learned IB verbs, the fast registration was actually done as a WR which
posted an IO operation to update the mappings. The new API changed all that but still has little bit
left in the WRs.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-05 18:02         ` Bob Pearson
@ 2021-09-07 12:01           ` Jason Gunthorpe
  2021-09-07 16:35             ` Bob Pearson
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-09-07 12:01 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Bart Van Assche, linux-rdma

On Sun, Sep 05, 2021 at 01:02:45PM -0500, Bob Pearson wrote:
> On 9/4/21 5:30 PM, Jason Gunthorpe wrote:
> > On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
> >> On 9/3/21 3:18 PM, Bob Pearson wrote:
> >>> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
> >>>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
> >>>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
> >>>>> working for rxe but there is still one error. After adding MW
> >>>>> support I added a test to local invalidate to check and see if the
> >>>>> l/rkey matched the key actually contained in the MR/MW when local
> >>>>> invalidate is called. This is failing for srp/002 with the key
> >>>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
> >>>>> does in fact increment the rkey by one and also has code that posts
> >>>>> a local invalidate. This was never checked before and is now failing
> >>>>> to match. If I mask off the key portion in the test the whole test
> >>>>> case passes so the other problems appear to have been fixed. If the
> >>>>> increment and invalidate are out of sync this could result in the
> >>>>> error. I suspect this may be a bug in srp. Worst case I can remove
> >>>>> this test but I would rather not.
> >>>>
> >>>> I didn't check the spec, but since SRP works with HW devices I wonder
> >>>> if invalidation is supposed to ignore the variant bits in the mkey?
> >>>
> >>> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
> >>> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
> >>> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
> >>> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
> >>> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
> >>> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
> >>> before the MR is turned on again.
> >>
> >> Hi Bob,
> >>
> >> If there would be any code in the SRP driver that is not compliant with the
> >> IBTA specification then I can fix it.
> >>
> >> Regarding the invalidate work requests submitted by the ib_srp driver: these
> >> are submitted before srp_fr_pool_put() is called. A new registration request
> >> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
> >> channel and there is one QP per RDMA channel. In other words,
> >> (re)registration requests are submitted to the same QP as unregistration
> >> requests after local invalidate requests. I think the IBTA requires does not
> >> allow to reorder a local invalidate followed by a fast registration request.
> > 
> > Right
> > 
> > Jason
> > 
> 
> srp_inv_rkey()
> 	wr = ...			builds local invalidate WR
> 	wr.send_flags = 0		i.e. not signaled
> 	ib_post_send()			posts the WR for delayed execution
> 
> srp_unmap_data()
> 	srp_inv_rkey()			schedules invalidate of each rkey in req
> 	srp_fr_pool_put()		puts each desc entry on free list
> 
> srp_map_finish_fr()
> 	...				misc checks not relevant
> 	desc = srp_fr_pool_get()	returns desc from free list
> 	rkey = ib_inc_rkey()		gets a new rkey one larger than the last one
> 	ib_update_fast_reg_key()	immediately changes mr->rkey to new value
> 	ib_map_mr_sg()			immediately updates buffer list in MR to new values
> 	wr = ...			set WR to REG_MR work request not signaled
> 	wr.key = new rkey
> 	ib_post_send()			wr is posted for delayed execution
> 
> So as soon as the MR has had a WR posted to invalidate it the code goes ahead and adds it to the
> free list and then as soon as a new MR is gotten from the free list the rkey and mappings are
> changed and then a WR is posted to 'register' the MR which marks it as valid again. The register
> WR *also* resets the rkey which is redundant with the ib_update_fast_reg_key() call.
> 
> All the work except for setting the state valid is done immediately regardless of the status of the
> completion of the previous invalidate and can complete before the MR is marked FREE. Because the WR
> is not signaled no one is checking the WC for these operations unless there is an error.

"HW" is not supposed to look at mr->rkey.

"HW" has a hidden cache of mr->rkey which is manipulated through
WQEs, and is then synchronous with the WQE stream as Bart said.

So it sounds like the problem is rxe is crossing the HW and SW layers
and checking the mr->rkey from HW logic instead of holding a 2nd HW
specific value for HW to use.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-07 12:01           ` Jason Gunthorpe
@ 2021-09-07 16:35             ` Bob Pearson
  2021-09-07 16:39               ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2021-09-07 16:35 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Bart Van Assche, linux-rdma

On 9/7/21 7:01 AM, Jason Gunthorpe wrote:
> On Sun, Sep 05, 2021 at 01:02:45PM -0500, Bob Pearson wrote:
>> On 9/4/21 5:30 PM, Jason Gunthorpe wrote:
>>> On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
>>>> On 9/3/21 3:18 PM, Bob Pearson wrote:
>>>>> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>>>>>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>>>>>>> working for rxe but there is still one error. After adding MW
>>>>>>> support I added a test to local invalidate to check and see if the
>>>>>>> l/rkey matched the key actually contained in the MR/MW when local
>>>>>>> invalidate is called. This is failing for srp/002 with the key
>>>>>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>>>>>>> does in fact increment the rkey by one and also has code that posts
>>>>>>> a local invalidate. This was never checked before and is now failing
>>>>>>> to match. If I mask off the key portion in the test the whole test
>>>>>>> case passes so the other problems appear to have been fixed. If the
>>>>>>> increment and invalidate are out of sync this could result in the
>>>>>>> error. I suspect this may be a bug in srp. Worst case I can remove
>>>>>>> this test but I would rather not.
>>>>>>
>>>>>> I didn't check the spec, but since SRP works with HW devices I wonder
>>>>>> if invalidation is supposed to ignore the variant bits in the mkey?
>>>>>
>>>>> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
>>>>> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
>>>>> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
>>>>> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
>>>>> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
>>>>> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
>>>>> before the MR is turned on again.
>>>>
>>>> Hi Bob,
>>>>
>>>> If there would be any code in the SRP driver that is not compliant with the
>>>> IBTA specification then I can fix it.
>>>>
>>>> Regarding the invalidate work requests submitted by the ib_srp driver: these
>>>> are submitted before srp_fr_pool_put() is called. A new registration request
>>>> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
>>>> channel and there is one QP per RDMA channel. In other words,
>>>> (re)registration requests are submitted to the same QP as unregistration
>>>> requests after local invalidate requests. I think the IBTA requires does not
>>>> allow to reorder a local invalidate followed by a fast registration request.
>>>
>>> Right
>>>
>>> Jason
>>>
>>
>> srp_inv_rkey()
>> 	wr = ...			builds local invalidate WR
>> 	wr.send_flags = 0		i.e. not signaled
>> 	ib_post_send()			posts the WR for delayed execution
>>
>> srp_unmap_data()
>> 	srp_inv_rkey()			schedules invalidate of each rkey in req
>> 	srp_fr_pool_put()		puts each desc entry on free list
>>
>> srp_map_finish_fr()
>> 	...				misc checks not relevant
>> 	desc = srp_fr_pool_get()	returns desc from free list
>> 	rkey = ib_inc_rkey()		gets a new rkey one larger than the last one
>> 	ib_update_fast_reg_key()	immediately changes mr->rkey to new value
>> 	ib_map_mr_sg()			immediately updates buffer list in MR to new values
>> 	wr = ...			set WR to REG_MR work request not signaled
>> 	wr.key = new rkey
>> 	ib_post_send()			wr is posted for delayed execution
>>
>> So as soon as the MR has had a WR posted to invalidate it the code goes ahead and adds it to the
>> free list and then as soon as a new MR is gotten from the free list the rkey and mappings are
>> changed and then a WR is posted to 'register' the MR which marks it as valid again. The register
>> WR *also* resets the rkey which is redundant with the ib_update_fast_reg_key() call.
>>
>> All the work except for setting the state valid is done immediately regardless of the status of the
>> completion of the previous invalidate and can complete before the MR is marked FREE. Because the WR
>> is not signaled no one is checking the WC for these operations unless there is an error.
> 
> "HW" is not supposed to look at mr->rkey.
> 
> "HW" has a hidden cache of mr->rkey which is manipulated through
> WQEs, and is then synchronous with the WQE stream as Bart said.
> 
> So it sounds like the problem is rxe is crossing the HW and SW layers
> and checking the mr->rkey from HW logic instead of holding a 2nd HW
> specific value for HW to use.
> 
> Jason
> 

Interesting. But if that is the case the bigger problem is the ib_map_mr_sg() call which updates the
mapping. rxe definitely does look at the mr->rkey value but we could fix that. It also looks at the
mapping which is updated by ib_map_mr_sg(). My impression is that HW also uses this mapping or does
HW also copy all the FMRs into SRAM? By not closing the loop on the invalidate by looking at the CQE
the srp driver exposes the MR with changing mappings to the new values through either the old or new
rkey depending on whether you cache the rkey.

There is a suggestive comment in ib_verbs.h
        /*

         * Kernel users should universally support relaxed ordering (RO), as

         * they are designed to read data only after observing the CQE and use

         * the DMA API correctly.

         *

         * Some drivers implicitly enable RO if platform supports it.

         */

        int (*map_mr_sg)(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,

                         unsigned int *sg_offset);

There seems to be an assumption that users will be looking at CQE.

Bob





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-07 16:35             ` Bob Pearson
@ 2021-09-07 16:39               ` Jason Gunthorpe
  2021-09-07 16:47                 ` Bob Pearson
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-09-07 16:39 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Bart Van Assche, linux-rdma

On Tue, Sep 07, 2021 at 11:35:17AM -0500, Bob Pearson wrote:

> Interesting. But if that is the case the bigger problem is the ib_map_mr_sg() call which updates the
> mapping. rxe definitely does look at the mr->rkey value but we could fix that. It also looks at the
> mapping which is updated by ib_map_mr_sg(). My impression is that HW also uses this mapping or does
> HW also copy all the FMRs into SRAM?

Yes, real HW has a copy of the DMA list. The sg in the mr struct is
for CPU use only.

It is not OK to use the CPU SG list inside the MR for DMA by HW, it
has to be synchronized with the WR.

> There seems to be an assumption that users will be looking at CQE.

Yes, the kernel has to be driven by CQE, not only for data transfer
but the DMA unmap of the SGL cannot be until after the invalidation
CQE is observed.

Ie the CPU should have two DMA lists active during the invalidation
cycle.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blktest/rxe almost working
  2021-09-07 16:39               ` Jason Gunthorpe
@ 2021-09-07 16:47                 ` Bob Pearson
  0 siblings, 0 replies; 10+ messages in thread
From: Bob Pearson @ 2021-09-07 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Bart Van Assche, linux-rdma

On 9/7/21 11:39 AM, Jason Gunthorpe wrote:
> On Tue, Sep 07, 2021 at 11:35:17AM -0500, Bob Pearson wrote:
> 
>> Interesting. But if that is the case the bigger problem is the ib_map_mr_sg() call which updates the
>> mapping. rxe definitely does look at the mr->rkey value but we could fix that. It also looks at the
>> mapping which is updated by ib_map_mr_sg(). My impression is that HW also uses this mapping or does
>> HW also copy all the FMRs into SRAM?
> 
> Yes, real HW has a copy of the DMA list. The sg in the mr struct is
> for CPU use only.
> 
> It is not OK to use the CPU SG list inside the MR for DMA by HW, it
> has to be synchronized with the WR.
> 
>> There seems to be an assumption that users will be looking at CQE.
> 
> Yes, the kernel has to be driven by CQE, not only for data transfer
> but the DMA unmap of the SGL cannot be until after the invalidation
> CQE is observed.
> 
> Ie the CPU should have two DMA lists active during the invalidation
> cycle.
> 
> Jason
> 

OK. Not 100% sure what that implies for SRP. SRP does *not* look at the CQE for invalidate and register
WQEs. I can fix the rkey and DMA list semantics, making a copy of the list which is installed by the
register WQE.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-07 16:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 21:41 blktest/rxe almost working Bob Pearson
2021-09-02 23:38 ` Jason Gunthorpe
2021-09-03 22:18   ` Bob Pearson
2021-09-03 23:13     ` Bart Van Assche
2021-09-04 22:30       ` Jason Gunthorpe
2021-09-05 18:02         ` Bob Pearson
2021-09-07 12:01           ` Jason Gunthorpe
2021-09-07 16:35             ` Bob Pearson
2021-09-07 16:39               ` Jason Gunthorpe
2021-09-07 16:47                 ` Bob Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).