Re: [PATCH for-next v11 10/13] RDMA/rxe: Stop lookup of partially built objects

From: Bob Pearson <rpearsonhpe@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: zyjzyj2000@gmail.com, linux-rdma@vger.kernel.org
Subject: Re: [PATCH for-next v11 10/13] RDMA/rxe: Stop lookup of partially built objects
Date: Tue, 15 Mar 2022 22:55:01 -0500	[thread overview]
Message-ID: <8aa69cff-2871-3015-fd2d-7279d8f3ddac@gmail.com> (raw)
In-Reply-To: <20220316001655.GV11336@nvidia.com>

On 3/15/22 19:16, Jason Gunthorpe wrote:
> On Thu, Mar 03, 2022 at 06:08:06PM -0600, Bob Pearson wrote:
>> Currently the rdma_rxe driver has a security weakness due to giving
>> objects which are partially initialized indices allowing external
>> actors to gain access to them by sending packets which refer to
>> their index (e.g. qpn, rkey, etc) causing unpredictable results.
>>
>> This patch adds two new APIs rxe_show(obj) and rxe_hide(obj) which
>> enable or disable looking up pool objects from indices using
>> rxe_pool_get_index(). By default objects are disabled. These APIs
>> are used to enable looking up objects which have indices:
>> AH, SRQ, QP, MR, and MW. They are added in create verbs after the
>> objects are fully initialized and as soon as possible in destroy
>> verbs.
> 
> In other parts of rdma we use the word 'finalize' where you used show
> 
> So rxe_pool_finalize() or something
> 
> I'm not clear on what hide is supposed to be for, if the object is
> being destroyed why do we need a period when it is NULL in the xarray
> before just erasing it?
The problem I am trying to solve is that when a destroy verb is called I want
to stop looking up rdma objects from their numbers (rkey, qpn, etc.) which
happens when a new packet arrives that refers to the object. So we start the
object destroy flow but we may hit a long delay if there are already
references taken on the object. In the next patch we are going to add a complete
wait_for_completion so that we won't return to rdma_core until all the refs
are dropped. While we are waiting for some past event to complete and drop its
reference new packets may arrive and take a reference on the object while looking it
up. Potentially this could happen many times. I just want to stop accepting any
new packets as soon as the destroy verb gets called. Meanwhile we will allow
past packets to complete. That is what hide() does.

Show() deals with the opposite problem. The way the object library worked as soon as
the object was created or 'added to the pool' it becomes searchable. But some of the
verbs have a lot of code to execute after the object gets its number assigned so by
setting the link to NULL in the xa_alloc call we get the index assigned but the
object is not searchable. show turns it on at the end for the create verb call after
all the init code is done.

There are likely better names but these were the ones that came to mind.

> 
>> @@ -221,8 +221,12 @@ static void rxe_elem_release(struct kref *kref)
>>  {
>>  	struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>  	struct rxe_pool *pool = elem->pool;
>> +	struct xarray *xa = &pool->xa;
>> +	unsigned long flags;
>>  
>> -	xa_erase(&pool->xa, elem->index);
>> +	xa_lock_irqsave(xa, flags);
>> +	__xa_erase(&pool->xa, elem->index);
>> +	xa_unlock_irqrestore(xa, flags);
> 
> I guess it has to do with this, but why have the xa_erase in the kref
> release at all?
> 
>>  	if (pool->cleanup)
>>  		pool->cleanup(elem);
>> @@ -242,3 +246,33 @@ int __rxe_put(struct rxe_pool_elem *elem)
>>  {
>>  	return kref_put(&elem->ref_cnt, rxe_elem_release);
>>  }
>> +
>> +int __rxe_show(struct rxe_pool_elem *elem)
>> +{
>> +	struct xarray *xa = &elem->pool->xa;
>> +	unsigned long flags;
>> +	void *ret;
>> +
>> +	xa_lock_irqsave(xa, flags);
>> +	ret = __xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
>> +	xa_unlock_irqrestore(xa, flags);
>> +	if (IS_ERR(ret))
>> +		return PTR_ERR(ret);
> 
> This can't fail due to the xa memory already being allocated. You can
> just WARN_ON here and 'finalize' should not return an error code.
> 
> If you want to be fancy this checks for other mistakes too:
> 
>    tmp = xa_cmpxchg((&elem->pool->xa, elem->index, XA_ZERO_ENTRY,  elem, 0)
>    WARN_ON(tmp != NULL);
> 
>> +int __rxe_hide(struct rxe_pool_elem *elem)
>> +{
>> +	struct xarray *xa = &elem->pool->xa;
>> +	unsigned long flags;
>> +	void *ret;
>> +
>> +	xa_lock_irqsave(xa, flags);
>> +	ret = __xa_store(&elem->pool->xa, elem->index, NULL, GFP_KERNEL);
>> +	xa_unlock_irqrestore(xa, flags);
> 
> IIRC storing NULL is the same as erase, isn't it?  You have to store
> XA_ZERO_ENTRY if you want to keep an allocated NULL
This works just fine. You are correct for normal xarrays but for allocating xarrays
setting the link to NULL is treated differently and it remembers. Here is the comment
in <kernel>/Documentation

Allocating XArrays

If you use DEFINE_XARRAY_ALLOC() to define the XArray, or initialise it by passing XA_FLAGS_ALLOC to xa_init_flags(), the XArray changes to track whether entries are in use or not.

You can call xa_alloc() to store the entry at an unused index in the XArray. If you need to modify the array from interrupt context, you can use xa_alloc_bh() or xa_alloc_irq() to disable interrupts while allocating the ID.

Using xa_store(), xa_cmpxchg() or xa_insert() will also mark the entry as being allocated. Unlike a normal XArray, storing NULL will mark the entry as being in use, like xa_reserve(). To free an entry, use xa_erase() (or xa_release() if you only want to free the entry if it’s NULL).

> 
>> +	if (IS_ERR(ret))
>> +		return PTR_ERR(ret);
>> +	else
>> +		return 0;
>> +}
> 
> Same remark about the error handling
> 
>> @@ -491,6 +497,7 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
>>  	struct rxe_qp *qp = to_rqp(ibqp);
>>  	int ret;
>>  
>> +	rxe_hide(qp);
>>  	ret = rxe_qp_chk_destroy(qp);
>>  	if (ret)
>>  		return ret;
> 
> So we decided not to destroy the QP but wrecked it in the xarray?
> 
> Not convinced about the hide at all..
This particular example is pretty light weight since rxe_qp_chk_destroy only makes one
memory reference. But dereg_mr actually can do a lot of work and in either case
the idea as explained above is not to wreck the object but prevent rxe_pool_get_index(pool, index)
from succeeding and taking a new ref on the object. Once the verbs client has called a destroy
verb I don't see any reason why we should continue to process newly arriving packets forever which
is the only place in the driver where we convert object numbers to objects.

There is another issue which is not being dealt with here and that is partially torn down
objects. After this patch the flow for a destroy verb for an indexed object is

hide() =	"disable new lookups, e.g. xa_store(NULL)"
		"misc tear down code"
rxe_put() = 	"drop a reference, will be last one if the object is quiescent"
		"if necessary wait until last ref dropped"
		"object cleanup code, includes xa_erase()"

If objects are still holding references you have to be careful to make sure that
nothing in misc tear down code matters for an outstanding reference. Currently this all works
but if any change breaks things we have had to defer the change to the cleanup phase.
It doesn't work by design but just debugging in the past.

Bob

> 
> Jason