nouveau.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: rcampbell@nvidia.com, willy@infradead.org,
	linux-doc@vger.kernel.org, nouveau@lists.freedesktop.org,
	hughd@google.com, linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, hch@infradead.org,
	linux-mm@kvack.org, shakeelb@google.com, bskeggs@redhat.com,
	jgg@nvidia.com, akpm@linux-foundation.org,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access
Date: Fri, 11 Jun 2021 11:01:42 -0400	[thread overview]
Message-ID: <YMN61r0wdg88OM8r@t490s> (raw)
In-Reply-To: <2683185.ETRjo6vMkr@nvdebian>

On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > > > internally which will generate that unwanted CLEAR notify.
> > > > >
> > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > >
> > > > >       if (likely(!pmd_trans_huge(pmdval)))
> > > > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > >
> > > > > So I don't think we have a problem here.
> > > >
> > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > > > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > >
> > > That seems correct - if the thp is not mapped with a pmd we won't split and we
> > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> > > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > > already been split.
> > 
> > Aha!
> > 
> > >
> > > The issue arises with doing it unconditionally in make device exclusive is that
> > > you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> > > understanding, please let me know if it doesn't make sense.
> > 
> > Exactly.  But if you see what I meant here, even if it can work like this, it
> > sounds still fragile, isn't it?  I just feel something is slightly off there..
> > 
> > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > performance, afaict, because if it's not a thp even without locking, then it
> > won't be, so further __split_huge_pmd() is not necessary.
> > 
> > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> 
> Well I would argue the performance of memory notifiers is becoming increasingly
> important, and a change that causes them to be called unnecessarily is
> therefore not very legal. Likely the correct fix here is to optimise
> __split_huge_pmd() to only call the notifier if it's actually going to split a
> pmd. As you said though that's a completely different story which I think would
> be best done as a separate series.

Right, maybe I can look a bit more into that later; but my whole point was to
express that one functionality shouldn't depend on such a trivial detail of
implementation of other modules (thp split in this case).

> 
> > This lets me goes back a step to think about why do we need this notifier at
> > all to cover this whole range of make_device_exclusive() procedure..
> > 
> > What I am thinking is, we're afraid some CPU accesses this page so the pte got
> > quickly restored when device atomic operation is carrying on.  Then with this
> > notifier we'll be able to cancel it.  Makes perfect sense.
> > 
> > However do we really need to register this notifier so early?  The thing is the
> > GPU driver still has all the page locks, so even if there's a race to restore
> > the ptes, they'll block at taking the page lock until the driver releases it.
> > 
> > IOW, I'm wondering whether the "non-fragile" way to do this is not do
> > mmu_interval_notifier_insert() that early: what if we register that notifier
> > after make_device_exclusive_range() returns but before page_unlock() somehow?
> > So before page_unlock(), race is protected fully by the lock itself; after
> > that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> > these notifications during marking exclusive (while we shouldn't)?
> 
> The notifier is needed to protect against races with pte changes. Once a page
> has been marked for exclusive access the driver will update it's page tables to
> allow atomic access to the page. However in the meantime the page could become
> unmapped entirely or write protected.
> 
> As I understand things the page lock won't protect against these kind of pte
> changes, hence the need for mmu_interval_read_begin/retry which allows the
> driver to hold a mutex protecting against invalidations via blocking the
> notifier until the device page tables have been updated.

Indeed, I suppose you mean change_pte_range() and zap_pte_range()
correspondingly.

Do you think we can restore pte right before wr-protect or zap?  Then all
things serializes with page lock (btw: it's already an insane userspace to
either unmap a page or wr-protect a page if it knows the device is using it!).
If these are the only two cases, it still sounds a cleaner approach to me than
the current approach.

This also reminded me that right now the cpu pgtable recovery is lazy - it
happens either from fork() or a cpu page fault.  Even after device finished
using it, swap ptes keep there.

What if the device tries to do atomic op on the same page twice?  I am not sure
whether it means we may also want to teach both GUP (majorly follow_page_pte()
for now before pmd support) and process of page_make_device_exclusive() with
understanding the device exclusive entries too?  Another option seems to be
restoring pte after device finish using it, as long as the device knows when.

-- 
Peter Xu

_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

  reply	other threads:[~2021-06-27  3:35 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-07  7:58 [Nouveau] [PATCH v10 00/10] Add support for SVM atomics in Nouveau Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 01/10] mm: Remove special swap entry functions Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 02/10] mm/swapops: Rework swap entry manipulation code Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 03/10] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 04/10] mm/rmap: Split migration into its own function Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 05/10] mm: Rename migrate_pgmap_owner Alistair Popple
2021-06-08 15:16   ` Peter Xu
2021-06-07  7:58 ` [Nouveau] [PATCH v10 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte() Alistair Popple
2021-06-08 15:19   ` Peter Xu
2021-06-07  7:58 ` [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access Alistair Popple
2021-06-08 18:33   ` Peter Xu
2021-06-09  9:38     ` Alistair Popple
2021-06-09 16:05       ` Peter Xu
2021-06-10  0:18         ` Alistair Popple
2021-06-10 18:04           ` Peter Xu
2021-06-10 14:21             ` Alistair Popple
2021-06-10 23:04               ` Peter Xu
2021-06-10 23:17                 ` Alistair Popple
2021-06-11  1:00                   ` Peter Xu
2021-06-11  3:43                     ` Alistair Popple
2021-06-11 15:01                       ` Peter Xu [this message]
2021-06-15  3:08                         ` Alistair Popple
2021-06-15 16:25                           ` Peter Xu
2021-06-16  2:47                             ` Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 08/10] mm: Selftests for exclusive device memory Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 09/10] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 10/10] nouveau/svm: Implement atomic SVM access Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YMN61r0wdg88OM8r@t490s \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bskeggs@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jgg@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=rcampbell@nvidia.com \
    --cc=shakeelb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).