Re: blktest/rxe almost working

From: Bob Pearson <rpearsonhpe@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>, Bart Van Assche <bvanassche@acm.org>
Cc: "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: blktest/rxe almost working
Date: Sun, 5 Sep 2021 13:02:45 -0500	[thread overview]
Message-ID: <fcf6f57e-972b-f88e-84bf-d1618fd3e23e@gmail.com> (raw)
In-Reply-To: <20210904223056.GC2505917@nvidia.com>

On 9/4/21 5:30 PM, Jason Gunthorpe wrote:
> On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
>> On 9/3/21 3:18 PM, Bob Pearson wrote:
>>> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
>>>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>>>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>>>>> working for rxe but there is still one error. After adding MW
>>>>> support I added a test to local invalidate to check and see if the
>>>>> l/rkey matched the key actually contained in the MR/MW when local
>>>>> invalidate is called. This is failing for srp/002 with the key
>>>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>>>>> does in fact increment the rkey by one and also has code that posts
>>>>> a local invalidate. This was never checked before and is now failing
>>>>> to match. If I mask off the key portion in the test the whole test
>>>>> case passes so the other problems appear to have been fixed. If the
>>>>> increment and invalidate are out of sync this could result in the
>>>>> error. I suspect this may be a bug in srp. Worst case I can remove
>>>>> this test but I would rather not.
>>>>
>>>> I didn't check the spec, but since SRP works with HW devices I wonder
>>>> if invalidation is supposed to ignore the variant bits in the mkey?
>>>
>>> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
>>> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
>>> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
>>> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
>>> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
>>> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
>>> before the MR is turned on again.
>>
>> Hi Bob,
>>
>> If there would be any code in the SRP driver that is not compliant with the
>> IBTA specification then I can fix it.
>>
>> Regarding the invalidate work requests submitted by the ib_srp driver: these
>> are submitted before srp_fr_pool_put() is called. A new registration request
>> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
>> channel and there is one QP per RDMA channel. In other words,
>> (re)registration requests are submitted to the same QP as unregistration
>> requests after local invalidate requests. I think the IBTA requires does not
>> allow to reorder a local invalidate followed by a fast registration request.
> 
> Right
> 
> Jason
> 

srp_inv_rkey()
	wr = ...			builds local invalidate WR
	wr.send_flags = 0		i.e. not signaled
	ib_post_send()			posts the WR for delayed execution

srp_unmap_data()
	srp_inv_rkey()			schedules invalidate of each rkey in req
	srp_fr_pool_put()		puts each desc entry on free list

srp_map_finish_fr()
	...				misc checks not relevant
	desc = srp_fr_pool_get()	returns desc from free list
	rkey = ib_inc_rkey()		gets a new rkey one larger than the last one
	ib_update_fast_reg_key()	immediately changes mr->rkey to new value
	ib_map_mr_sg()			immediately updates buffer list in MR to new values
	wr = ...			set WR to REG_MR work request not signaled
	wr.key = new rkey
	ib_post_send()			wr is posted for delayed execution

So as soon as the MR has had a WR posted to invalidate it the code goes ahead and adds it to the
free list and then as soon as a new MR is gotten from the free list the rkey and mappings are
changed and then a WR is posted to 'register' the MR which marks it as valid again. The register
WR *also* resets the rkey which is redundant with the ib_update_fast_reg_key() call.

All the work except for setting the state valid is done immediately regardless of the status of the
completion of the previous invalidate and can complete before the MR is marked FREE. Because the WR
is not signaled no one is checking the WC for these operations unless there is an error.

The old code worked because the key part of the rkey wasn't checked for the invalidate. By changing
the rkey before the mappings random stray old RDMA operations will fail because the rkey is not
matching and not because the MR is not VALID. There is a theoretical risk here because the MR could
be accessed through the new rkey with either the new or old mappings or a mixture while the MR is
still VALID on the old mapping before the invalidate succeeds.

Many years ago when I first learned IB verbs, the fast registration was actually done as a WR which
posted an IO operation to update the mappings. The new API changed all that but still has little bit
left in the WRs.

Bob