All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>, Christoph Hellwig <hch@lst.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
Date: Wed, 23 Sep 2020 09:10:43 -0400	[thread overview]
Message-ID: <20200923131043.GA59978@xz-x1> (raw)
In-Reply-To: <20200923002735.GN19098@xz-x1>

On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f3521be80ca..6591f3f33299 100644
> > > +++ b/mm/memory.c
> > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >                  * Because we'll need to release the locks before doing cow,
> > >                  * pass this work to upper layer.
> > >                  */
> > > -               if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > -                   page_maybe_dma_pinned(page)) {
> > > +               if (wp && page_maybe_dma_pinned(page) &&
> > > +                   READ_ONCE(src_mm->has_pinned)) {
> > >                         /* We've got the page already; we're safe */
> > >                         data->cow_old_page = page;
> > >                         data->cow_oldpte = *src_pte;
> > > 
> > > I can also add some more comment to emphasize this.
> > 
> > It is not just that, but the ptep_set_wrprotect() has to be done
> > earlier.
> 
> Now I understand your point, I think..  So I guess it's not only about
> has_pinned, but it should be a race between the fast-gup and the fork() code,
> even if has_pinned is always set.
> 
> > 
> > Otherwise it races like:
> > 
> >    pin_user_pages_fast()                   fork()
> >     atomic_set(has_pinned, 1);
> >     [..]
> >                                            atomic_read(page->_refcount) //false
> >                                            // skipped atomic_read(has_pinned)
> >     atomic_add(page->_refcount)
> >     ordered check write protect()
> >                                            ordered set write protect()
> > 
> > And now have a write protect on a DMA pinned page, which is the
> > invarient we are trying to create.
> > 
> > The best algorithm I've thought of is something like:
> > 
> >  pte_map_lock()
> >   if (page) {
> >       if (wp) {
> > 	  ptep_set_wrprotect()
> > 	  /* Order with try_grab_compound_head(), either we see
> > 	   * page_maybe_dma_pinned(), or they see the wrprotect */
> > 	  get_page();
> 
> Is this get_page() a must to be after ptep_set_wrprotect() explicitly?  IIUC
> what we need is to order ptep_set_wrprotect() and page_maybe_dma_pinned() here.
> E.g., would a "mb()" work?
> 
> Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> to guarantee ordering of try_grab_compound_head() and the pte change check?
> 
> > 
> > 	  if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
> > 	       put_page();
> > 	       ptep_clear_wrprotect()
> > 
> > 	       // do copy
> > 	       return
> > 	  }
> >       } else {
> > 	  get_page();
> >       }
> >       page_dup_rmap()
> >  pte_unmap_lock()
> > 
> > Then the do_wp_page() path would have to detect that the page is not
> > write protected under the pte lock inside the fault handler and just
> > do nothing.
> 
> Yes, iiuc do_wp_page() should be able to handle spurious write page faults like
> this already, as below:
> 
> 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> 	spin_lock(vmf->ptl);
>         ...
> 	if (vmf->flags & FAULT_FLAG_WRITE) {
> 		if (!pte_write(entry))
> 			return do_wp_page(vmf);
> 		entry = pte_mkdirty(entry);
> 	}
> 
> So when spin_lock() returns:
> 
>   - When it's a real cow (not pinned pages; we write-protected it and it keeps
>     write-protected), we should do cow here as usual.
> 
>   - When it's a fake cow (pinned pages), the write bit should have been
>     recovered before the page table lock released, and we'll skip do_wp_page()
>     and retry the page fault immediately.
> 
> > Ie the set/clear could be visible to the CPU and trigger a
> > spurious fault, but never trigger a COW.
> > 
> > Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.
> 
> Another question is, how about read fast-gup for pinning?  Because we can't use
> the write-protect mechanism to block a read gup.  I remember we've discussed
> similar things and iirc your point is "pinned pages should always be with
> WRITE".  However now I still doubt it...  Because I feel like read gup is still
> legal (as I mentioned previously - when device purely writes to the page and
> the processor only reads from it).
> 
> > 
> > Very tricky, deserves a huge comment near the ptep_clear_wrprotect()
> > 
> > Consider the above algorithm beside the gup_fast() algorithm:
> > 
> > 		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> > 			goto pte_unmap;
> >                 [..]
> > 		head = try_grab_compound_head(page, 1, flags);
> > 		if (!head)
> > 			goto pte_unmap;
> > 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > 			put_compound_head(head, 1, flags);
> > 			goto pte_unmap;
> > 
> > That last *ptep will check that the WP is not set after making
> > page_maybe_dma_pinned() true.
> > 
> > It still looks reasonable, the extra work is still just the additional
> > atomic in page_maybe_dma_pinned(), just everything else has to be very
> > carefully sequenced due to unlocked page table accessors.
> 
> Tricky!  I'm still thinking about some easier way but no much clue so far.
> Hopefully we'll figure out something solid soon.

Hmm, how about something like below?  Would this be acceptable?

------8<--------
diff --git a/mm/gup.c b/mm/gup.c
index 2d9019bf1773..698bc2b520ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2136,6 +2136,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
        struct dev_pagemap *pgmap = NULL;
        int nr_start = *nr, ret = 0;
        pte_t *ptep, *ptem;
+       spinlock_t *ptl = NULL;
+
+       /*
+        * More strict with FOLL_PIN, otherwise it could race with fork().  The
+        * page table lock guarantees that fork() will capture all the pinned
+        * pages when dup_mm() and do proper page copy on them.
+        */
+       if (flags & FOLL_PIN) {
+               ptl = pte_lockptr(mm, pmd);
+               if (!spin_trylock(ptl))
+                       return 0;
+       }
 
        ptem = ptep = pte_offset_map(&pmd, addr);
        do {
@@ -2200,6 +2212,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
        ret = 1;
 
 pte_unmap:
+       if (ptl)
+               spin_unlock(ptl);
        if (pgmap)
                put_dev_pagemap(pgmap);
        pte_unmap(ptem);
------8<--------

Both of the solution would fail some fast-gups that might have succeeded in the
past.  The latter solution might even fail more (because pmd lock should be
definitely bigger than a single pte wrprotect), however afaict it's still a
very, very corner case as it's fast-gup+FOLL_PIN+lockfail (and not to mention
fast-gup should be allowed to fail).

To confirm it can fail, I also checked up that we have only one caller of
pin_user_pages_fast_only(), which is i915_gem_userptr_get_pages().  While it's:

	if (mm == current->mm) {
		pvec = kvmalloc_array(num_pages, sizeof(struct page *),
				      GFP_KERNEL |
				      __GFP_NORETRY |
				      __GFP_NOWARN);
		if (pvec) {
			/* defer to worker if malloc fails */
			if (!i915_gem_object_is_readonly(obj))
				gup_flags |= FOLL_WRITE;
			pinned = pin_user_pages_fast_only(obj->userptr.ptr,
							  num_pages, gup_flags,
							  pvec);
		}
	}

So looks like it can fallback to something slow too even if purely unlucky.  So
looks safe so far for either solution above.

-- 
Peter Xu


  reply	other threads:[~2020-09-23 13:10 UTC|newest]

Thread overview: 133+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21 21:17 [PATCH 0/5] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-21 21:17 ` [PATCH 1/5] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-21 21:43   ` Jann Horn
2020-09-21 21:43     ` Jann Horn
2020-09-21 22:30     ` Peter Xu
2020-09-21 22:47       ` Jann Horn
2020-09-21 22:47         ` Jann Horn
2020-09-22 11:54         ` Jason Gunthorpe
2020-09-22 14:28           ` Peter Xu
2020-09-22 15:56             ` Jason Gunthorpe
2020-09-22 16:25               ` Linus Torvalds
2020-09-22 16:25                 ` Linus Torvalds
2020-09-21 23:53   ` John Hubbard
2020-09-22  0:01     ` John Hubbard
2020-09-22 15:17     ` Peter Xu
2020-09-22 16:10       ` Jason Gunthorpe
2020-09-22 17:54         ` Peter Xu
2020-09-22 19:11           ` Jason Gunthorpe
2020-09-23  0:27             ` Peter Xu
2020-09-23 13:10               ` Peter Xu [this message]
2020-09-23 14:20                 ` Jan Kara
2020-09-23 17:12                   ` Jason Gunthorpe
2020-09-24  7:44                     ` Jan Kara
2020-09-24 14:02                       ` Jason Gunthorpe
2020-09-24 14:45                         ` Jan Kara
2020-09-23 17:07               ` Jason Gunthorpe
2020-09-24 14:35                 ` Peter Xu
2020-09-24 16:51                   ` Jason Gunthorpe
2020-09-24 17:55                     ` Peter Xu
2020-09-24 18:15                       ` Jason Gunthorpe
2020-09-24 18:34                         ` Peter Xu
2020-09-24 18:39                           ` Jason Gunthorpe
2020-09-24 21:30                             ` Peter Xu
2020-09-25 19:56                               ` Linus Torvalds
2020-09-25 19:56                                 ` Linus Torvalds
2020-09-25 21:06                                 ` Linus Torvalds
2020-09-25 21:06                                   ` Linus Torvalds
2020-09-26  0:41                                   ` Jason Gunthorpe
2020-09-26  1:15                                     ` Linus Torvalds
2020-09-26  1:15                                       ` Linus Torvalds
2020-09-26 22:28                                       ` Linus Torvalds
2020-09-26 22:28                                         ` Linus Torvalds
2020-09-27  6:23                                         ` Leon Romanovsky
2020-09-27 18:16                                           ` Linus Torvalds
2020-09-27 18:16                                             ` Linus Torvalds
2020-09-27 18:45                                             ` Linus Torvalds
2020-09-27 18:45                                               ` Linus Torvalds
2020-09-28 12:49                                               ` Jason Gunthorpe
2020-09-28 16:17                                                 ` Linus Torvalds
2020-09-28 16:17                                                   ` Linus Torvalds
2020-09-28 17:22                                                   ` Peter Xu
2020-09-28 17:54                                                     ` Linus Torvalds
2020-09-28 17:54                                                       ` Linus Torvalds
2020-09-28 18:39                                                       ` Jason Gunthorpe
2020-09-28 19:29                                                         ` Linus Torvalds
2020-09-28 19:29                                                           ` Linus Torvalds
2020-09-28 23:57                                                           ` Jason Gunthorpe
2020-09-29  0:18                                                             ` John Hubbard
2020-09-28 19:36                                                         ` Linus Torvalds
2020-09-28 19:36                                                           ` Linus Torvalds
2020-09-28 19:50                                                           ` Linus Torvalds
2020-09-28 19:50                                                             ` Linus Torvalds
2020-09-28 22:51                                                             ` Jason Gunthorpe
2020-09-29  0:30                                                               ` Peter Xu
2020-10-08  5:49                                                             ` Leon Romanovsky
2020-09-28 17:13                                             ` Peter Xu
2020-09-25 21:13                                 ` Peter Xu
2020-09-25 22:08                                   ` Linus Torvalds
2020-09-25 22:08                                     ` Linus Torvalds
2020-09-22 18:02       ` John Hubbard
2020-09-22 18:15         ` Peter Xu
2020-09-22 19:11       ` John Hubbard
2020-09-27  0:41   ` [mm] 698ac7610f: will-it-scale.per_thread_ops 8.2% improvement kernel test robot
2020-09-27  0:41     ` kernel test robot
2020-09-21 21:17 ` [PATCH 2/5] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-21 21:17 ` [PATCH 3/5] mm: Rework return value for copy_one_pte() Peter Xu
2020-09-22  7:11   ` John Hubbard
2020-09-22 15:29     ` Peter Xu
2020-09-22 10:08   ` Oleg Nesterov
2020-09-22 10:18     ` Oleg Nesterov
2020-09-22 15:36       ` Peter Xu
2020-09-22 15:48         ` Oleg Nesterov
2020-09-22 16:03           ` Peter Xu
2020-09-22 16:53             ` Oleg Nesterov
2020-09-22 18:13               ` Peter Xu
2020-09-22 18:23                 ` Oleg Nesterov
2020-09-22 18:49                   ` Peter Xu
2020-09-23  6:52                     ` Oleg Nesterov
2020-09-23 17:16   ` Linus Torvalds
2020-09-23 17:16     ` Linus Torvalds
2020-09-23 21:24     ` Linus Torvalds
2020-09-23 21:24       ` Linus Torvalds
2020-09-21 21:20 ` [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-21 21:55   ` Jann Horn
2020-09-21 21:55     ` Jann Horn
2020-09-21 22:18     ` John Hubbard
2020-09-21 22:27       ` Jann Horn
2020-09-21 22:27         ` Jann Horn
2020-09-22  0:08         ` John Hubbard
2020-09-21 22:27     ` Peter Xu
2020-09-22 11:48   ` Oleg Nesterov
2020-09-22 12:40     ` Oleg Nesterov
2020-09-22 15:58       ` Peter Xu
2020-09-22 16:52         ` Oleg Nesterov
2020-09-22 18:34           ` Peter Xu
2020-09-22 18:44             ` Oleg Nesterov
2020-09-23  1:03               ` Peter Xu
2020-09-23 20:25                 ` Linus Torvalds
2020-09-23 20:25                   ` Linus Torvalds
2020-09-24 15:08                   ` Peter Xu
2020-09-24 11:48   ` Kirill Tkhai
2020-09-24 15:16     ` Peter Xu
2020-09-21 21:20 ` [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
2020-09-22  6:41   ` John Hubbard
2020-09-22 10:33     ` Jan Kara
2020-09-22 20:01       ` John Hubbard
2020-09-23  9:22         ` Jan Kara
2020-09-23 13:50           ` Peter Xu
2020-09-23 14:01             ` Jan Kara
2020-09-23 15:44               ` Peter Xu
2020-09-23 20:19                 ` John Hubbard
2020-09-24 18:49                   ` Peter Xu
2020-09-23 16:06     ` Peter Xu
2020-09-22 12:05   ` Jason Gunthorpe
2020-09-23 15:24     ` Peter Xu
2020-09-23 16:07       ` Yang Shi
2020-09-23 16:07         ` Yang Shi
2020-09-24 15:47         ` Peter Xu
2020-09-24 17:29           ` Yang Shi
2020-09-24 17:29             ` Yang Shi
2020-09-23 17:17       ` Jason Gunthorpe
2020-09-23 10:21 ` [PATCH 0/5] mm: Break COW for pinned pages during fork() Leon Romanovsky
2020-09-23 15:37   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200923131043.GA59978@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.