Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe

From: Ralph Campbell <rcampbell@nvidia.com>
To: Jason Gunthorpe <jgg@mellanox.com>
Cc: "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	Jerome Glisse <jglisse@redhat.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>
Subject: Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
Date: Tue, 17 Dec 2019 13:50:24 -0800	[thread overview]
Message-ID: <59d4ea9e-3f6b-11c2-75d1-5baecd5b4ae2@nvidia.com> (raw)
In-Reply-To: <20191217205147.GI16762@mellanox.com>

On 12/17/19 12:51 PM, Jason Gunthorpe wrote:
> On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote:
>> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
>> be called safely from inside the invalidate() callback. This is fine for
>> devices with explicit memory region register and unregister calls but it
>> is desirable from a programming model standpoint to not require explicit
>> memory region registration. Regions can be registered based on device
>> address faults but without a mechanism for updating or removing the mmu
>> interval notifiers in response to munmap(), the invalidation callbacks
>> will be for regions that are stale or apply to different mmaped regions.
> 
> What we do in RDMA is drive the removal from a work queue, as we need
> a synchronize_srcu anyhow to serialize everything to do with
> destroying a part of the address space mirror.
> 
> Is it really necessary to have all this stuff just to save doing
> something like a work queue?

Well, the invalidates already have to use the driver lock to synchronize
so handling the range tracking updates semi-synchronously seems more
straightforward to me.

Do you feel strongly that adding a work queue is the right way to handle
this?

> Also, I think we are not taking core kernel APIs like this with out an
> in-kernel user??

Right. I was looking for feedback before updating nouveau to use it.

>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 9e6caa8ecd19..55fbefcdc564 100644
>> +++ b/include/linux/mmu_notifier.h
>> @@ -233,11 +233,18 @@ struct mmu_notifier {
>>    * @invalidate: Upon return the caller must stop using any SPTEs within this
>>    *              range. This function can sleep. Return false only if sleeping
>>    *              was required but mmu_notifier_range_blockable(range) is false.
>> + * @release:	This function will be called when the mmu_interval_notifier
>> + *		is removed from the interval tree. Defining this function also
>> + *		allows mmu_interval_notifier_remove() and
>> + *		mmu_interval_notifier_update() to be called from the
>> + *		invalidate() callback function (i.e., they won't block waiting
>> + *		for invalidations to finish.
> 
> Having a function called remove that doesn't block seems like very
> poor choice of language, we've tended to use put to describe that
> operation.
> 
> The difference is meaningful as people often create use after free
> bugs in drivers when presented with interfaces named 'remove' or
> 'destroy' that don't actually guarentee there is not going to be
> continued accesses to the memory.

OK. I can rename it put().

>>    */
>>   struct mmu_interval_notifier_ops {
>>   	bool (*invalidate)(struct mmu_interval_notifier *mni,
>>   			   const struct mmu_notifier_range *range,
>>   			   unsigned long cur_seq);
>> +	void (*release)(struct mmu_interval_notifier *mni);
>>   };
>>   
>>   struct mmu_interval_notifier {
>> @@ -246,6 +253,8 @@ struct mmu_interval_notifier {
>>   	struct mm_struct *mm;
>>   	struct hlist_node deferred_item;
>>   	unsigned long invalidate_seq;
>> +	unsigned long deferred_start;
>> +	unsigned long deferred_last;
> 
> I couldn't quite understand how something like this can work, what is
> preventing parallel updates?

It is serialized by the struct mmu_notifier_mm lock.
If there are no tasks walking the interval tree, the update
happens synchronously under the lock. If there are walkers,
the start/last values are stored under the lock and the last caller's
values are used to update the interval tree when the last walker
finishes (under the lock again).

>> +/**
>> + * mmu_interval_notifier_update - Update interval notifier end
>> + * @mni: Interval notifier to update
>> + * @start: New starting virtual address to monitor
>> + * @length: New length of the range to monitor
>> + *
>> + * This function updates the range being monitored.
>> + * If there is no release() function defined, the call will wait for the
>> + * update to finish before returning.
>> + */
>> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
>> +				 unsigned long start, unsigned long length)
>> +{
> 
> Update should probably be its own patch
> 
> Jason

OK.
Thanks for the review.