From: "Kuehling, Felix" <Felix.Kuehling@amd.com>
To: Jason Gunthorpe <jgg@ziepe.ca>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Jerome Glisse <jglisse@redhat.com>,
Ralph Campbell <rcampbell@nvidia.com>,
John Hubbard <jhubbard@nvidia.com>
Cc: "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
Ben Skeggs <bskeggs@redhat.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"Zhou, David(ChunMing)" <David1.Zhou@amd.com>,
Dennis Dalessandro <dennis.dalessandro@intel.com>,
Juergen Gross <jgross@suse.com>,
Mike Marciniszyn <mike.marciniszyn@intel.com>,
Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>,
Petr Cvek <petrcvekcz@gmail.com>,
Stefano Stabellini <sstabellini@kernel.org>,
"nouveau@lists.freedesktop.org" <nouveau@lists.freedesktop.org>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
Christoph Hellwig <hch@infradead.org>,
Jason Gunthorpe <jgg@mellanox.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
Date: Tue, 29 Oct 2019 22:04:45 +0000 [thread overview]
Message-ID: <786ee79d-00a6-9147-f410-d8856da35511@amd.com> (raw)
In-Reply-To: <20191028201032.6352-3-jgg@ziepe.ca>
I haven't had enough time to fully understand the deferred logic in this
change. I spotted one problem, see comments inline.
On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs. 4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages(). Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
> - An interval tree tracking VA ranges, with per-range callbacks
> - A read/write locking scheme for the interval tree that avoids
> sleeping in the notifier path (for OOM killer)
> - A sequence counter based collision-retry locking scheme to tell
> device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
> invalidates are done, via active_invalidate_ranges count.
> This approach avoids having to intersect the interval tree twice (as
> umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
> via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
> This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
> write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Acked-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
> include/linux/mmu_notifier.h | 98 +++++++
> mm/Kconfig | 1 +
> mm/mmu_notifier.c | 533 +++++++++++++++++++++++++++++++++--
> 3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
> * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
> * can't go away from under us as exit_mmap holds an mm_count pin
> * itself.
> */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> + struct mm_struct *mm)
> {
> struct mmu_notifier *mn;
> int id;
>
> + if (mmn_mm->has_interval)
> + mn_itree_release(mmn_mm, mm);
> +
> + if (hlist_empty(&mmn_mm->list))
> + return;
This seems to duplicate the conditions in __mmu_notifier_release. See my
comments below, I think one of them is wrong. I suspect this one,
because __mmu_notifier_release follows the same pattern as the other
notifiers.
> +
> /*
> * SRCU here will block mmu_notifier_unregister until
> * ->release returns.
> */
> id = srcu_read_lock(&srcu);
> - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
> /*
> * If ->release runs before mmu_notifier_unregister it must be
> * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
> if (mn->ops->release)
> mn->ops->release(mn, mm);
>
> - spin_lock(&mm->mmu_notifier_mm->lock);
> - while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> - mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> + spin_lock(&mmn_mm->lock);
> + while (unlikely(!hlist_empty(&mmn_mm->list))) {
> + mn = hlist_entry(mmn_mm->list.first,
> struct mmu_notifier,
> hlist);
> /*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
> */
> hlist_del_init_rcu(&mn->hlist);
> }
> - spin_unlock(&mm->mmu_notifier_mm->lock);
> + spin_unlock(&mmn_mm->lock);
> srcu_read_unlock(&srcu, id);
>
> /*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
> synchronize_srcu(&srcu);
> }
>
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> + if (mmn_mm->has_interval)
> + mn_itree_release(mmn_mm, mm);
If mmn_mm->list is not empty, this will be done twice because
mn_hlist_release duplicates this.
> +
> + if (!hlist_empty(&mmn_mm->list))
> + mn_hlist_release(mmn_mm, mm);
mn_hlist_release checks the same condition itself.
> +}
> +
> /*
> * If no young bitflag is supported by the hardware, ->clear_flush_young can
> * unmap the address and return 1 or 0 depending if the mapping previously
[snip]
next prev parent reply other threads:[~2019-10-29 22:04 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
2019-11-05 21:23 ` John Hubbard
2019-11-06 13:36 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
2019-10-29 22:04 ` Kuehling, Felix [this message]
2019-10-29 22:56 ` Jason Gunthorpe
2019-11-07 0:23 ` John Hubbard
2019-11-07 2:08 ` Jerome Glisse
2019-11-07 20:11 ` Jason Gunthorpe
2019-11-07 21:04 ` Jerome Glisse
2019-11-08 0:32 ` Jason Gunthorpe
2019-11-08 2:00 ` Jerome Glisse
2019-11-08 20:19 ` Jason Gunthorpe
2019-11-07 20:06 ` Jason Gunthorpe
2019-11-07 20:53 ` John Hubbard
2019-11-08 15:26 ` Jason Gunthorpe
2019-11-08 6:33 ` Christoph Hellwig
2019-11-08 13:43 ` Jerome Glisse
2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
2019-10-29 12:19 ` Dennis Dalessandro
2019-10-29 12:51 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-29 7:48 ` Koenig, Christian
2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
2019-11-01 18:26 ` Jason Gunthorpe
2019-11-05 14:44 ` Jürgen Groß
2019-11-07 9:39 ` Jürgen Groß
2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-30 16:55 ` Boris Ostrovsky
2019-11-01 17:48 ` Jason Gunthorpe
2019-11-01 18:51 ` Boris Ostrovsky
2019-11-01 19:17 ` Jason Gunthorpe
2019-11-04 22:03 ` Boris Ostrovsky
2019-11-05 2:31 ` Jason Gunthorpe
2019-11-05 15:16 ` Boris Ostrovsky
2019-11-07 20:36 ` Jason Gunthorpe
2019-11-07 22:54 ` Boris Ostrovsky
2019-11-08 14:53 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
2019-10-29 7:49 ` Koenig, Christian
2019-10-29 16:28 ` Kuehling, Felix
2019-10-29 13:07 ` Christian König
2019-10-29 17:19 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
2019-10-29 7:51 ` Koenig, Christian
2019-10-29 13:59 ` Jason Gunthorpe
2019-10-29 22:14 ` Kuehling, Felix
2019-10-29 23:09 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
2019-10-29 19:22 ` Yang, Philip
2019-10-29 19:25 ` Jason Gunthorpe
2019-11-01 14:44 ` Yang, Philip
2019-11-01 15:12 ` Jason Gunthorpe
2019-11-01 15:59 ` Yang, Philip
2019-11-01 17:42 ` Jason Gunthorpe
2019-11-01 19:19 ` Jason Gunthorpe
2019-11-01 19:45 ` Yang, Philip
2019-11-01 19:50 ` Yang, Philip
2019-11-01 19:51 ` Jason Gunthorpe
2019-11-01 18:21 ` Jason Gunthorpe
2019-11-01 18:34 ` [PATCH v2a " Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-11-01 20:54 ` Ralph Campbell
2019-11-04 20:40 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=786ee79d-00a6-9147-f410-d8856da35511@amd.com \
--to=felix.kuehling@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=David1.Zhou@amd.com \
--cc=aarcange@redhat.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bskeggs@redhat.com \
--cc=dennis.dalessandro@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@infradead.org \
--cc=jgg@mellanox.com \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=jgross@suse.com \
--cc=jhubbard@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=mike.marciniszyn@intel.com \
--cc=nouveau@lists.freedesktop.org \
--cc=oleksandr_andrushchenko@epam.com \
--cc=petrcvekcz@gmail.com \
--cc=rcampbell@nvidia.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).