From: John Hubbard <jhubbard@nvidia.com>
To: Jason Gunthorpe <jgg@mellanox.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
"Jerome Glisse" <jglisse@redhat.com>,
"Ralph Campbell" <rcampbell@nvidia.com>,
"Felix.Kuehling@amd.com" <Felix.Kuehling@amd.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Ben Skeggs" <bskeggs@redhat.com>,
"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
"Christian König" <christian.koenig@amd.com>,
"David Zhou" <David1.Zhou@amd.com>,
"Dennis Dalessandro" <dennis.dalessandro@intel.com>,
"Juergen Gross" <jgross@suse.com>,
"Mike Marciniszyn" <mike.marciniszyn@intel.com>,
"Oleksandr Andrushchenko" <oleksandr_andrushchenko@epam.com>,
"Petr Cvek" <petrcvekcz@gmail.com>,
"Stefano Stabellini" <sstabellini@kernel.org>,
"nouveau@lists.freedesktop.org" <nouveau@lists.freedesktop.org>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
"Christoph Hellwig" <hch@infradead.org>,
"Andrea Arcangeli" <aarcange@redhat.com>,
"Michal Hocko" <mhocko@kernel.org>
Subject: Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
Date: Thu, 7 Nov 2019 12:53:56 -0800 [thread overview]
Message-ID: <9dc2b3c7-f945-b645-b3a3-313a21d2fdfc@nvidia.com> (raw)
In-Reply-To: <20191107200604.GB21728@mellanox.com>
On 11/7/19 12:06 PM, Jason Gunthorpe wrote:
...
>>
>> Also, it is best moved down to be next to the new MNR structs, so that all the
>> MNR stuff is in one group.
>
> I agree with Jerome, this enum is part of the 'struct
> mmu_notifier_range' (ie the description of the invalidation) and it
> doesn't really matter that only these new notifiers can be called with
> this type, it is still part of the mmu_notifier_range.
>
OK.
> The comment already says it only applies to the mmu_range_notifier
> scheme..
>
>>> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>>> @@ -222,6 +228,26 @@ struct mmu_notifier {
>>> unsigned int users;
>>> };
>>>
>>
>> That should also be moved down, next to the new structs.
>
> Which this?
I was referring to MMU_NOTIFIER_RANGE_BLOCKABLE, above. Trying
to put all the new range notifier stuff in one place. But maybe not,
if these are really not as separate as I thought.
>
>>> +/**
>>> + * struct mmu_range_notifier_ops
>>> + * @invalidate: Upon return the caller must stop using any SPTEs within this
>>> + * range, this function can sleep. Return false if blocking was
>>> + * required but range is non-blocking
>>> + */
>>
>> How about this (I'm not sure I fully understand the return value, though):
>>
>> /**
>> * struct mmu_range_notifier_ops
>> * @invalidate: Upon return the caller must stop using any SPTEs within this
>> * range.
>> *
>> * This function is permitted to sleep.
>> *
>> * @Return: false if blocking was required, but @range is
>> * non-blocking.
>> *
>> */
>
> Is this kdoc format for function pointers?
heh, I'm sort of winging it, I'm not sure how function pointers are supposed
to be documented in kdoc. Actually the only key take-away here is to write
"This function can sleep"
as a separate sentence..
...
>> c) Rename new one. Ideas:
>>
>> struct mmu_interval_notifier
>> struct mmu_range_intersection
>> ...other ideas?
>
> This odd duality has already cause some confusion, but names here are
> hard. mmu_interval_notifier is the best alternative I've heard.
>
> Changing this name is a lot of work - are we happy
> 'mmu_interval_notifier' is the right choice?
Yes, it's my favorite too. I'd vote for going with that.
...
>>
>>
>> OK, this either needs more documentation and assertions, or a different
>> approach. Because I see addition, subtraction, AND, OR and booleans
>> all being applied to this field, and it's darn near hopeless to figure
>> out whether or not it really is even or odd at the right times.
>
> This is a standard design for a seqlock scheme and follows the
> existing design of the linux seq lock.
>
> The lower bit indicates the lock'd state and the upper bits indicate
> the generation of the lock
>
> The operations on the lock itself are then:
> seq |= 1 # Take the lock
> seq++ # Release an acquired lock
> seq & 1 # True if locked
>
> Which is how this is written
Very nice, would you be open to putting that into (any) one of the comment
headers? That's an unusually clear and concise description:
/*
* This is a standard design for a seqlock scheme and follows the
* existing design of the linux seq lock.
*
* The lower bit indicates the lock'd state and the upper bits indicate
* the generation of the lock
*
* The operations on the lock itself are then:
* seq |= 1 # Take the lock
* seq++ # Release an acquired lock
* seq & 1 # True if locked
*/
>
>> Different approach: why not just add a mmn_mm->is_invalidating
>> member variable? It's not like you're short of space in that struct.
>
> Splitting it makes alot of stuff more complex and unnatural.
>
OK, agreed.
> The ops above could be put in inline wrappers, but they only occur
> only in functions already called mn_itree_inv_start_range() and
> mn_itree_inv_end() and mn_itree_is_invalidating().
>
> There is the one 'take the lock' outlier in
> __mmu_range_notifier_insert() though
>
>>> +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>>> +{
>>> + struct mmu_range_notifier *mrn;
>>> + struct hlist_node *next;
>>> + bool need_wake = false;
>>> +
>>> + spin_lock(&mmn_mm->lock);
>>> + if (--mmn_mm->active_invalidate_ranges ||
>>> + !mn_itree_is_invalidating(mmn_mm)) {
>>> + spin_unlock(&mmn_mm->lock);
>>> + return;
>>> + }
>>> +
>>> + mmn_mm->invalidate_seq++;
>>
>> Is this the right place for an assertion that this is now an even value?
>
> Yes, but I'm reluctant to add such a runtime check on this fastish path..
> How about a comment?
Sure.
>
>>> + need_wake = true;
>>> +
>>> + /*
>>> + * The inv_end incorporates a deferred mechanism like
>>> + * rtnl_lock(). Adds and removes are queued until the final inv_end
>>
>> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
>> But I suppose if one studies that file closely there is more. :)
>
> Lets change that to rtnl_unlock() then
Thanks :)
...
>>> + * mrn->invalidate_seq is always set to an odd value. This ensures
>>
>> This claim just looks wrong the first N times one reads the code, given that
>> there is mmu_range_set_seq() to set it to an arbitrary value! Maybe
>> you mean
>
> mmu_range_set_seq() is NOT to be used to set to an arbitary value, it
> must only be used to set to the value provided in the invalidate()
> callback and that value is always odd. Lets make this super clear:
>
> /*
> * mrn->invalidate_seq must always be set to an odd value via
> * mmu_range_set_seq() using the provided cur_seq from
> * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> * will always clear the below sleep in some reasonable time as
> * mmn_mm->invalidate_seq is even in the idle state.
> */
>
OK, that helps a lot.
...
>>> + mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
>>
>> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
>> for seq numbers here? I'm acutely unhappy trying to figure this out.
>> I suspect it's another unfortunate side effect of trying to use the
>> lower bit of the seq number (even/odd) for something else.
>
> No, this is actually done for the seq number itself. We need to
> generate a seq number that is != the current invalidate_seq as this
> new mrn is not invalidating.
>
> The best seq to use is one that the invalidate_seq will not reach for
> a long time, ie 'invalidate_seq + MAX' which is expressed as -1
>
> The even/odd thing just takes care of itself naturally here as
> invalidate_seq is guarenteed even and -1 creates both an odd mrn value
> and a good seq number.
>
> The algorithm would actually work correctly if this was
> 'mrn->invalidate_seq = 1', but occasionally things would block when
> they don't need to block.
>
> Lets add a comment:
>
> /*
> * The starting seq for a mrn not under invalidation should be
> * odd, not equal to the current invalidate_seq and
> * invalidate_seq should not 'wrap' to the new seq any time
> * soon.
> */
Very helpful. How about this additional tweak:
/*
* The starting seq for a mrn not under invalidation should be
* odd, not equal to the current invalidate_seq and
* invalidate_seq should not 'wrap' to the new seq any time
* soon. Subtracting 1 from the current (even) value achieves that.
*/
>
>>> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
>>> + unsigned long start, unsigned long length,
>>> + struct mm_struct *mm)
>>> +{
>>> + struct mmu_notifier_mm *mmn_mm;
>>> + int ret;
>>
>> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
>> I'll check on that. It's correct here, though.
>
> Looks OK in my tree?
Nope, that's how I found it. The top of your mmu_notifier branch has this:
int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
{
struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
int ret = 0;
if (mmn_mm->has_interval) {
ret = mn_itree_invalidate(mmn_mm, range);
if (ret)
return ret;
}
if (!hlist_empty(&mmn_mm->list))
return mn_hlist_invalidate_range_start(mmn_mm, range);
return 0;
}
>
>>> + might_lock(&mm->mmap_sem);
>>> +
>>> + mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
>>
>> What does the above pair with? Should have a comment that specifies that.
>
> smp_load_acquire() always pairs with smp_store_release() to the same
> memory, there is only one store, is a comment really needed?
>
> Below are the comment updates I made, thanks!
>
> Jason
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 51b92ba013ddce..065c95002e9602 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -302,15 +302,15 @@ void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
> /**
> * mmu_range_set_seq - Save the invalidation sequence
> * @mrn - The mrn passed to invalidate
> - * @cur_seq - The cur_seq passed to invalidate
> + * @cur_seq - The cur_seq passed to the invalidate() callback
> *
> * This must be called unconditionally from the invalidate callback of a
> * struct mmu_range_notifier_ops under the same lock that is used to call
> * mmu_range_read_retry(). It updates the sequence number for later use by
> - * mmu_range_read_retry().
> + * mmu_range_read_retry(). The provided cur_seq will always be odd.
> *
> - * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
> - * then this call is not required.
> + * If the caller does not call mmu_range_read_begin() or
> + * mmu_range_read_retry() then this call is not required.
> */
> static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
> unsigned long cur_seq)
> @@ -348,8 +348,9 @@ static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> * collided with this lock and a future mmu_range_read_retry() will return
> * true.
> *
> - * False is not reliable and only suggests a collision has not happened. It
> - * can be called many times and does not have to hold the user provided lock.
> + * False is not reliable and only suggests a collision may not have
> + * occured. It can be called many times and does not have to hold the user
> + * provided lock.
> *
> * This call can be used as part of loops and other expensive operations to
> * expedite a retry.
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 2b7485919ecfeb..afe1e2d94183f8 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -51,7 +51,8 @@ struct mmu_notifier_mm {
> * This is a collision-retry read-side/write-side 'lock', a lot like a
> * seqcount, however this allows multiple write-sides to hold it at
> * once. Conceptually the write side is protecting the values of the PTEs in
> - * this mm, such that PTES cannot be read into SPTEs while any writer exists.
> + * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
> + * writer exists.
> *
> * Note that the core mm creates nested invalidate_range_start()/end() regions
> * within the same thread, and runs invalidate_range_start()/end() in parallel
> @@ -64,12 +65,13 @@ struct mmu_notifier_mm {
> *
> * The write side has two states, fully excluded:
> * - mm->active_invalidate_ranges != 0
> - * - mnn->invalidate_seq & 1 == True
> + * - mnn->invalidate_seq & 1 == True (odd)
> * - some range on the mm_struct is being invalidated
> * - the itree is not allowed to change
> *
> * And partially excluded:
> * - mm->active_invalidate_ranges != 0
> + * - mnn->invalidate_seq & 1 == False (even)
> * - some range on the mm_struct is being invalidated
> * - the itree is allowed to change
> *
> @@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> return;
> }
>
> + /* Make invalidate_seq even */
> mmn_mm->invalidate_seq++;
> need_wake = true;
>
> /*
> * The inv_end incorporates a deferred mechanism like
> - * rtnl_lock(). Adds and removes are queued until the final inv_end
> + * rtnl_unlock(). Adds and removes are queued until the final inv_end
> * happens then they are progressed. This arrangement for tree updates
> * is used to avoid using a blocking lock during
> * invalidate_range_start.
> @@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
> spin_unlock(&mmn_mm->lock);
>
> /*
> - * mrn->invalidate_seq is always set to an odd value. This ensures
> - * that if seq does wrap we will always clear the below sleep in some
> - * reasonable time as mmn_mm->invalidate_seq is even in the idle
> - * state.
> + * mrn->invalidate_seq must always be set to an odd value via
> + * mmu_range_set_seq() using the provided cur_seq from
> + * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> + * will always clear the below sleep in some reasonable time as
> + * mmn_mm->invalidate_seq is even in the idle state.
> */
> lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> @@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> mrn->invalidate_seq = mmn_mm->invalidate_seq;
> } else {
> WARN_ON(mn_itree_is_invalidating(mmn_mm));
> + /*
> + * The starting seq for a mrn not under invalidation should be
> + * odd, not equal to the current invalidate_seq and
> + * invalidate_seq should not 'wrap' to the new seq any time
> + * soon.
> + */
> mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
> }
>
Looks good. We're just polishing up minor points now, so you can add:
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
thanks,
--
John Hubbard
NVIDIA
next prev parent reply other threads:[~2019-11-07 20:56 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
2019-11-05 21:23 ` John Hubbard
2019-11-06 13:36 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
2019-10-29 22:04 ` Kuehling, Felix
2019-10-29 22:56 ` Jason Gunthorpe
2019-11-07 0:23 ` John Hubbard
2019-11-07 2:08 ` Jerome Glisse
2019-11-07 20:11 ` Jason Gunthorpe
2019-11-07 21:04 ` Jerome Glisse
2019-11-08 0:32 ` Jason Gunthorpe
2019-11-08 2:00 ` Jerome Glisse
2019-11-08 20:19 ` Jason Gunthorpe
2019-11-07 20:06 ` Jason Gunthorpe
2019-11-07 20:53 ` John Hubbard [this message]
2019-11-08 15:26 ` Jason Gunthorpe
2019-11-08 6:33 ` Christoph Hellwig
2019-11-08 13:43 ` Jerome Glisse
2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
2019-10-29 12:19 ` Dennis Dalessandro
2019-10-29 12:51 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-29 7:48 ` Koenig, Christian
2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
2019-11-01 18:26 ` Jason Gunthorpe
2019-11-05 14:44 ` Jürgen Groß
2019-11-07 9:39 ` Jürgen Groß
2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-30 16:55 ` Boris Ostrovsky
2019-11-01 17:48 ` Jason Gunthorpe
2019-11-01 18:51 ` Boris Ostrovsky
2019-11-01 19:17 ` Jason Gunthorpe
2019-11-04 22:03 ` Boris Ostrovsky
2019-11-05 2:31 ` Jason Gunthorpe
2019-11-05 15:16 ` Boris Ostrovsky
2019-11-07 20:36 ` Jason Gunthorpe
2019-11-07 22:54 ` Boris Ostrovsky
2019-11-08 14:53 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert " Jason Gunthorpe
2019-10-29 7:51 ` Koenig, Christian
2019-10-29 13:59 ` Jason Gunthorpe
2019-10-29 22:14 ` Kuehling, Felix
2019-10-29 23:09 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
2019-10-29 19:22 ` Yang, Philip
2019-10-29 19:25 ` Jason Gunthorpe
2019-11-01 14:44 ` Yang, Philip
2019-11-01 15:12 ` Jason Gunthorpe
2019-11-01 15:59 ` Yang, Philip
2019-11-01 17:42 ` Jason Gunthorpe
2019-11-01 19:19 ` Jason Gunthorpe
2019-11-01 19:45 ` Yang, Philip
2019-11-01 19:50 ` Yang, Philip
2019-11-01 19:51 ` Jason Gunthorpe
2019-11-01 18:21 ` Jason Gunthorpe
2019-11-01 18:34 ` [PATCH v2a " Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
[not found] ` <20191028201032.6352-13-jgg@ziepe.ca>
2019-10-29 7:49 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Koenig, Christian
2019-10-29 16:28 ` Kuehling, Felix
2019-10-29 13:07 ` Christian König
2019-10-29 17:19 ` Jason Gunthorpe
2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-11-01 20:54 ` Ralph Campbell
2019-11-04 20:40 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9dc2b3c7-f945-b645-b3a3-313a21d2fdfc@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=David1.Zhou@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=aarcange@redhat.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bskeggs@redhat.com \
--cc=christian.koenig@amd.com \
--cc=dennis.dalessandro@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@infradead.org \
--cc=jgg@mellanox.com \
--cc=jglisse@redhat.com \
--cc=jgross@suse.com \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=mike.marciniszyn@intel.com \
--cc=nouveau@lists.freedesktop.org \
--cc=oleksandr_andrushchenko@epam.com \
--cc=petrcvekcz@gmail.com \
--cc=rcampbell@nvidia.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).