Re: blktest/rxe almost working

From: Bob Pearson <rpearsonhpe@gmail.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Bart Van Assche <bvanassche@acm.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: blktest/rxe almost working
Date: Tue, 7 Sep 2021 11:35:17 -0500	[thread overview]
Message-ID: <9e6783d0-554c-17de-c72f-fae766099480@gmail.com> (raw)
In-Reply-To: <20210907120156.GV1200268@ziepe.ca>

On 9/7/21 7:01 AM, Jason Gunthorpe wrote:
> On Sun, Sep 05, 2021 at 01:02:45PM -0500, Bob Pearson wrote:
>> On 9/4/21 5:30 PM, Jason Gunthorpe wrote:
>>> On Fri, Sep 03, 2021 at 04:13:22PM -0700, Bart Van Assche wrote:
>>>> On 9/3/21 3:18 PM, Bob Pearson wrote:
>>>>> On 9/2/21 6:38 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Sep 02, 2021 at 04:41:15PM -0500, Bob Pearson wrote:
>>>>>>> Now that for-next is on 5.14.0-rc6+ blktest srp/002 is very close to
>>>>>>> working for rxe but there is still one error. After adding MW
>>>>>>> support I added a test to local invalidate to check and see if the
>>>>>>> l/rkey matched the key actually contained in the MR/MW when local
>>>>>>> invalidate is called. This is failing for srp/002 with the key
>>>>>>> portion of the rkey off by one. Looking at ib_srp.c I see code that
>>>>>>> does in fact increment the rkey by one and also has code that posts
>>>>>>> a local invalidate. This was never checked before and is now failing
>>>>>>> to match. If I mask off the key portion in the test the whole test
>>>>>>> case passes so the other problems appear to have been fixed. If the
>>>>>>> increment and invalidate are out of sync this could result in the
>>>>>>> error. I suspect this may be a bug in srp. Worst case I can remove
>>>>>>> this test but I would rather not.
>>>>>>
>>>>>> I didn't check the spec, but since SRP works with HW devices I wonder
>>>>>> if invalidation is supposed to ignore the variant bits in the mkey?
>>>>>
>>>>> I am a little worried. srp is pretty complex but roughly it looks like it maintains a pool of
>>>>> MRs which it recycles. Each time it reuses the MR it increments the key portion of the rkey. Before
>>>>> that it uses local invalidate WRs to invalidate the MRs presumably to prevent stray accesses
>>>>> to the old version of the MR from e.g. replicated packets. It posts these WRs to a send queue but I
>>>>> don't see where it closes the loop by waiting for a WC so there may be a race between the invalidate
>>>>> and the subsequent map_sg call. The invalidate marks the MR as not usable so this must all happen
>>>>> before the MR is turned on again.
>>>>
>>>> Hi Bob,
>>>>
>>>> If there would be any code in the SRP driver that is not compliant with the
>>>> IBTA specification then I can fix it.
>>>>
>>>> Regarding the invalidate work requests submitted by the ib_srp driver: these
>>>> are submitted before srp_fr_pool_put() is called. A new registration request
>>>> is submitted after srp_fr_pool_get() succeeds. There is one MR pool per RDMA
>>>> channel and there is one QP per RDMA channel. In other words,
>>>> (re)registration requests are submitted to the same QP as unregistration
>>>> requests after local invalidate requests. I think the IBTA requires does not
>>>> allow to reorder a local invalidate followed by a fast registration request.
>>>
>>> Right
>>>
>>> Jason
>>>
>>
>> srp_inv_rkey()
>> 	wr = ...			builds local invalidate WR
>> 	wr.send_flags = 0		i.e. not signaled
>> 	ib_post_send()			posts the WR for delayed execution
>>
>> srp_unmap_data()
>> 	srp_inv_rkey()			schedules invalidate of each rkey in req
>> 	srp_fr_pool_put()		puts each desc entry on free list
>>
>> srp_map_finish_fr()
>> 	...				misc checks not relevant
>> 	desc = srp_fr_pool_get()	returns desc from free list
>> 	rkey = ib_inc_rkey()		gets a new rkey one larger than the last one
>> 	ib_update_fast_reg_key()	immediately changes mr->rkey to new value
>> 	ib_map_mr_sg()			immediately updates buffer list in MR to new values
>> 	wr = ...			set WR to REG_MR work request not signaled
>> 	wr.key = new rkey
>> 	ib_post_send()			wr is posted for delayed execution
>>
>> So as soon as the MR has had a WR posted to invalidate it the code goes ahead and adds it to the
>> free list and then as soon as a new MR is gotten from the free list the rkey and mappings are
>> changed and then a WR is posted to 'register' the MR which marks it as valid again. The register
>> WR *also* resets the rkey which is redundant with the ib_update_fast_reg_key() call.
>>
>> All the work except for setting the state valid is done immediately regardless of the status of the
>> completion of the previous invalidate and can complete before the MR is marked FREE. Because the WR
>> is not signaled no one is checking the WC for these operations unless there is an error.
> 
> "HW" is not supposed to look at mr->rkey.
> 
> "HW" has a hidden cache of mr->rkey which is manipulated through
> WQEs, and is then synchronous with the WQE stream as Bart said.
> 
> So it sounds like the problem is rxe is crossing the HW and SW layers
> and checking the mr->rkey from HW logic instead of holding a 2nd HW
> specific value for HW to use.
> 
> Jason
> 

Interesting. But if that is the case the bigger problem is the ib_map_mr_sg() call which updates the
mapping. rxe definitely does look at the mr->rkey value but we could fix that. It also looks at the
mapping which is updated by ib_map_mr_sg(). My impression is that HW also uses this mapping or does
HW also copy all the FMRs into SRAM? By not closing the loop on the invalidate by looking at the CQE
the srp driver exposes the MR with changing mappings to the new values through either the old or new
rkey depending on whether you cache the rkey.

There is a suggestive comment in ib_verbs.h
        /*

         * Kernel users should universally support relaxed ordering (RO), as

         * they are designed to read data only after observing the CQE and use

         * the DMA API correctly.

         *

         * Some drivers implicitly enable RO if platform supports it.

         */

        int (*map_mr_sg)(struct ib_mr *mr, struct scatterlist *sg, int sg_nents,

                         unsigned int *sg_offset);

There seems to be an assumption that users will be looking at CQE.

Bob