nouveau.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: rcampbell@nvidia.com, willy@infradead.org,
	linux-doc@vger.kernel.org, nouveau@lists.freedesktop.org,
	hughd@google.com, linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, hch@infradead.org,
	linux-mm@kvack.org, shakeelb@google.com, bskeggs@redhat.com,
	jgg@nvidia.com, akpm@linux-foundation.org,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access
Date: Wed, 16 Jun 2021 12:47:36 +1000	[thread overview]
Message-ID: <3687765.3f4LTt0jjm@nvdebian> (raw)
In-Reply-To: <YMjUZX8Sy0PuPt6j@t490s>

On Wednesday, 16 June 2021 2:25:09 AM AEST Peter Xu wrote:
> On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> > On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:

[...]

> > > Do you think we can restore pte right before wr-protect or zap?  Then all
> > > things serializes with page lock (btw: it's already an insane userspace to
> > > either unmap a page or wr-protect a page if it knows the device is using it!).
> > > If these are the only two cases, it still sounds a cleaner approach to me than
> > > the current approach.
> >
> > Perhaps we could but it would make {zap|change}_pte_range() much more complex as
> > we can't sleep taking the page lock whilst holding the ptl, so we'd have to
> > implement a retry scheme similar to copy_pte_range() in both those functions as
> > well.
> 
> Yes, but shouldn't be hard to do so, imho. E.g., see when __tlb_remove_page()
> returns true in zap_pte_range(), so we already did something like that.  IMHO
> it's not uncommon to have such facilities as we do have requirements to sleep
> during a spinlock critical section for a lot of places in mm, so we release
> them when needed and retake.

Agreed, it's not hard to do and it's a common enough pattern. However we decided
that for such a specific application this (trying to take the lock or drop locks
and retry) was too complex for copy_pte_range() so it seems like the same should
apply here.

Admittedly copy_pte_range() already had several other retry paths so perhaps
it was adding yet another that made it relatively more complex. Overall I have
been trying to minimise the impact on core mm code for this feature, and adding
this pattern to zap_pte_range(), etc. would make it more complex for any future
addition that requires locks to be dropped and retried so I guess in that sense
it is no different.

> > Given mmu_interval_read_begin/retry was IMHO added to solve this type of
> > problem (freezing pte's to safely program device pte's) it seems like the
> > better option rather than adding more complex code to generic mm paths.
> >
> > It's also worth noting i915 seems to use mmu_interval_read_begin/retry() with
> > gup to sync mappings so this isn't an entirely new concept. I'm not an expert
> > in that driver but I imagine changing gup to generate unconditional mmu notifier
> > invalidates would also cause issues there. So I think overall this is the
> > cleanest solution as it reduces the amount of code (particularly in generic mm
> > paths).
> 
> I could be wrong somewhere, but to me depending on mmu notifiers being
> "accurate" in general is fragile..
> 
> Take an example of change_pte_range(), which will generate PROTECTION_VMA
> notifies.  Let's imaging an userspace calls mprotect() e.g. twice or even more
> times with the same PROT_* and upon the same region, we know very possibly the
> 2nd,3rd,... calls will generate those notifies with totally no change to the
> pgtable at all as they're all done on the 1st shot.  However we'll generate mmu
> notifies anyways for the 2nd,3rd,... calls.  It means mmu notifiers should
> really be tolerant of false positives as it does happen, and such thing can be
> triggered even from userspace system calls very easily like this.  That's why I
> think any kernel facility that depends on mmu notifiers being accurate is
> probably not the right approach..

Argh, thanks. I was focused on the specifics of this series but I think I
understand your point better now - that as a more general principle we can't
assume notifiers are accurate.

> But yeah as you said I think it's working as is with the series (I think the
> follow_pmd_mask() checking pmd_trans_huge before calling split_huge_pmd is a
> double safety-net for it, so even if the GUP split_huge_pmd got replaced with
> __split_huge_pmd it should still work with the one-retry logic), not sure
> whether it matters a lot, as it's not common mm path; I think I'll step back so
> Andrew could still pick it up as wish, I'm just still not fully convinced it's
> the best solution to have for a long term to depend on that..

Ok, thanks. I guess you have somewhat convinced me - depending on it for the
long term might be a bit fragile. However as you say the current implementation
does work and I am starting to look at support for PMD and file backed pages
which require changes here anyway. So I am hoping Andrew can still take this
(once rebased) as it would be easier for me to do those changes if the basic
support and clean ups were already in place.

> > > This also reminded me that right now the cpu pgtable recovery is lazy - it
> > > happens either from fork() or a cpu page fault.  Even after device finished
> > > using it, swap ptes keep there.
> > >
> > > What if the device tries to do atomic op on the same page twice?  I am not sure
> > > whether it means we may also want to teach both GUP (majorly follow_page_pte()
> > > for now before pmd support) and process of page_make_device_exclusive() with
> > > understanding the device exclusive entries too?  Another option seems to be
> > > restoring pte after device finish using it, as long as the device knows when.
> >
> > I don't think we need to complicate follow_page_pte() with knowledge of
> > exclusive entries. GUP will just restore the original pte via the normal
> > fault path - follow_page_pte() will return NULL for an exclusive entry,
> > resulting in handle_mm_path() getting called via faultin_page(). Therefore
> > a driver calling make_device_exclusive() twice on the same page won't cause an
> > issue. Also the device shouldn't fault on subsequent accesses if the exclusive
> > entry is still in place anyway.
> 
> Right, looks good then.
> 
> >
> > We can't restore the pte when the device is finished with it because there is
> > no way of knowing when a device is done using an exclusive entry - device
> > pte's work much the same as cpu pte's in that regard.
> 
> I see, I feel like I understand how it works slightly better now, thanks.

Feel free to ask if there are any more details you want, but there's nothing too
magical going on here.

> One last pure question: I see nouveau_atomic_range_fault() will call the other
> nvif_object_ioctl() which seems to do the device pgtable mapping, am I right?

Correct - that installs the page table mapping on the GPU.

> Then I see the notifier is quickly removed before nouveau_atomic_range_fault()
> returns.  What happens if CPU access happens after mmu notifier removed?  Or is
> it not possible to happen?

So there are two notifiers registered - this one and a process wide notifier
(see nouveau_mn_ops). In this case the process wide notifier will get called
to invalidate the access when the CPU fault removes the device exclusive
entries.

 - Alistair

> --
> Peter Xu
> 



_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

  reply	other threads:[~2021-06-16  2:47 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-07  7:58 [Nouveau] [PATCH v10 00/10] Add support for SVM atomics in Nouveau Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 01/10] mm: Remove special swap entry functions Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 02/10] mm/swapops: Rework swap entry manipulation code Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 03/10] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 04/10] mm/rmap: Split migration into its own function Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 05/10] mm: Rename migrate_pgmap_owner Alistair Popple
2021-06-08 15:16   ` Peter Xu
2021-06-07  7:58 ` [Nouveau] [PATCH v10 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte() Alistair Popple
2021-06-08 15:19   ` Peter Xu
2021-06-07  7:58 ` [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access Alistair Popple
2021-06-08 18:33   ` Peter Xu
2021-06-09  9:38     ` Alistair Popple
2021-06-09 16:05       ` Peter Xu
2021-06-10  0:18         ` Alistair Popple
2021-06-10 18:04           ` Peter Xu
2021-06-10 14:21             ` Alistair Popple
2021-06-10 23:04               ` Peter Xu
2021-06-10 23:17                 ` Alistair Popple
2021-06-11  1:00                   ` Peter Xu
2021-06-11  3:43                     ` Alistair Popple
2021-06-11 15:01                       ` Peter Xu
2021-06-15  3:08                         ` Alistair Popple
2021-06-15 16:25                           ` Peter Xu
2021-06-16  2:47                             ` Alistair Popple [this message]
2021-06-07  7:58 ` [Nouveau] [PATCH v10 08/10] mm: Selftests for exclusive device memory Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 09/10] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
2021-06-07  7:58 ` [Nouveau] [PATCH v10 10/10] nouveau/svm: Implement atomic SVM access Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3687765.3f4LTt0jjm@nvdebian \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bskeggs@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jgg@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nouveau@lists.freedesktop.org \
    --cc=peterx@redhat.com \
    --cc=rcampbell@nvidia.com \
    --cc=shakeelb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).