All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>, Christoph Hellwig <hch@lst.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
Date: Tue, 22 Sep 2020 20:27:35 -0400	[thread overview]
Message-ID: <20200923002735.GN19098@xz-x1> (raw)
In-Reply-To: <20200922191116.GK8409@ziepe.ca>

On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8f3521be80ca..6591f3f33299 100644
> > +++ b/mm/memory.c
> > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >                  * Because we'll need to release the locks before doing cow,
> >                  * pass this work to upper layer.
> >                  */
> > -               if (READ_ONCE(src_mm->has_pinned) && wp &&
> > -                   page_maybe_dma_pinned(page)) {
> > +               if (wp && page_maybe_dma_pinned(page) &&
> > +                   READ_ONCE(src_mm->has_pinned)) {
> >                         /* We've got the page already; we're safe */
> >                         data->cow_old_page = page;
> >                         data->cow_oldpte = *src_pte;
> > 
> > I can also add some more comment to emphasize this.
> 
> It is not just that, but the ptep_set_wrprotect() has to be done
> earlier.

Now I understand your point, I think..  So I guess it's not only about
has_pinned, but it should be a race between the fast-gup and the fork() code,
even if has_pinned is always set.

> 
> Otherwise it races like:
> 
>    pin_user_pages_fast()                   fork()
>     atomic_set(has_pinned, 1);
>     [..]
>                                            atomic_read(page->_refcount) //false
>                                            // skipped atomic_read(has_pinned)
>     atomic_add(page->_refcount)
>     ordered check write protect()
>                                            ordered set write protect()
> 
> And now have a write protect on a DMA pinned page, which is the
> invarient we are trying to create.
> 
> The best algorithm I've thought of is something like:
> 
>  pte_map_lock()
>   if (page) {
>       if (wp) {
> 	  ptep_set_wrprotect()
> 	  /* Order with try_grab_compound_head(), either we see
> 	   * page_maybe_dma_pinned(), or they see the wrprotect */
> 	  get_page();

Is this get_page() a must to be after ptep_set_wrprotect() explicitly?  IIUC
what we need is to order ptep_set_wrprotect() and page_maybe_dma_pinned() here.
E.g., would a "mb()" work?

Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
to guarantee ordering of try_grab_compound_head() and the pte change check?

> 
> 	  if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
> 	       put_page();
> 	       ptep_clear_wrprotect()
> 
> 	       // do copy
> 	       return
> 	  }
>       } else {
> 	  get_page();
>       }
>       page_dup_rmap()
>  pte_unmap_lock()
> 
> Then the do_wp_page() path would have to detect that the page is not
> write protected under the pte lock inside the fault handler and just
> do nothing.

Yes, iiuc do_wp_page() should be able to handle spurious write page faults like
this already, as below:

	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
	spin_lock(vmf->ptl);
        ...
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}

So when spin_lock() returns:

  - When it's a real cow (not pinned pages; we write-protected it and it keeps
    write-protected), we should do cow here as usual.

  - When it's a fake cow (pinned pages), the write bit should have been
    recovered before the page table lock released, and we'll skip do_wp_page()
    and retry the page fault immediately.

> Ie the set/clear could be visible to the CPU and trigger a
> spurious fault, but never trigger a COW.
> 
> Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.

Another question is, how about read fast-gup for pinning?  Because we can't use
the write-protect mechanism to block a read gup.  I remember we've discussed
similar things and iirc your point is "pinned pages should always be with
WRITE".  However now I still doubt it...  Because I feel like read gup is still
legal (as I mentioned previously - when device purely writes to the page and
the processor only reads from it).

> 
> Very tricky, deserves a huge comment near the ptep_clear_wrprotect()
> 
> Consider the above algorithm beside the gup_fast() algorithm:
> 
> 		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> 			goto pte_unmap;
>                 [..]
> 		head = try_grab_compound_head(page, 1, flags);
> 		if (!head)
> 			goto pte_unmap;
> 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> 			put_compound_head(head, 1, flags);
> 			goto pte_unmap;
> 
> That last *ptep will check that the WP is not set after making
> page_maybe_dma_pinned() true.
> 
> It still looks reasonable, the extra work is still just the additional
> atomic in page_maybe_dma_pinned(), just everything else has to be very
> carefully sequenced due to unlocked page table accessors.

Tricky!  I'm still thinking about some easier way but no much clue so far.
Hopefully we'll figure out something solid soon.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2020-09-23  0:27 UTC|newest]

Thread overview: 133+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21 21:17 [PATCH 0/5] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-21 21:17 ` [PATCH 1/5] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-21 21:43   ` Jann Horn
2020-09-21 21:43     ` Jann Horn
2020-09-21 22:30     ` Peter Xu
2020-09-21 22:47       ` Jann Horn
2020-09-21 22:47         ` Jann Horn
2020-09-22 11:54         ` Jason Gunthorpe
2020-09-22 14:28           ` Peter Xu
2020-09-22 15:56             ` Jason Gunthorpe
2020-09-22 16:25               ` Linus Torvalds
2020-09-22 16:25                 ` Linus Torvalds
2020-09-21 23:53   ` John Hubbard
2020-09-22  0:01     ` John Hubbard
2020-09-22 15:17     ` Peter Xu
2020-09-22 16:10       ` Jason Gunthorpe
2020-09-22 17:54         ` Peter Xu
2020-09-22 19:11           ` Jason Gunthorpe
2020-09-23  0:27             ` Peter Xu [this message]
2020-09-23 13:10               ` Peter Xu
2020-09-23 14:20                 ` Jan Kara
2020-09-23 17:12                   ` Jason Gunthorpe
2020-09-24  7:44                     ` Jan Kara
2020-09-24 14:02                       ` Jason Gunthorpe
2020-09-24 14:45                         ` Jan Kara
2020-09-23 17:07               ` Jason Gunthorpe
2020-09-24 14:35                 ` Peter Xu
2020-09-24 16:51                   ` Jason Gunthorpe
2020-09-24 17:55                     ` Peter Xu
2020-09-24 18:15                       ` Jason Gunthorpe
2020-09-24 18:34                         ` Peter Xu
2020-09-24 18:39                           ` Jason Gunthorpe
2020-09-24 21:30                             ` Peter Xu
2020-09-25 19:56                               ` Linus Torvalds
2020-09-25 19:56                                 ` Linus Torvalds
2020-09-25 21:06                                 ` Linus Torvalds
2020-09-25 21:06                                   ` Linus Torvalds
2020-09-26  0:41                                   ` Jason Gunthorpe
2020-09-26  1:15                                     ` Linus Torvalds
2020-09-26  1:15                                       ` Linus Torvalds
2020-09-26 22:28                                       ` Linus Torvalds
2020-09-26 22:28                                         ` Linus Torvalds
2020-09-27  6:23                                         ` Leon Romanovsky
2020-09-27 18:16                                           ` Linus Torvalds
2020-09-27 18:16                                             ` Linus Torvalds
2020-09-27 18:45                                             ` Linus Torvalds
2020-09-27 18:45                                               ` Linus Torvalds
2020-09-28 12:49                                               ` Jason Gunthorpe
2020-09-28 16:17                                                 ` Linus Torvalds
2020-09-28 16:17                                                   ` Linus Torvalds
2020-09-28 17:22                                                   ` Peter Xu
2020-09-28 17:54                                                     ` Linus Torvalds
2020-09-28 17:54                                                       ` Linus Torvalds
2020-09-28 18:39                                                       ` Jason Gunthorpe
2020-09-28 19:29                                                         ` Linus Torvalds
2020-09-28 19:29                                                           ` Linus Torvalds
2020-09-28 23:57                                                           ` Jason Gunthorpe
2020-09-29  0:18                                                             ` John Hubbard
2020-09-28 19:36                                                         ` Linus Torvalds
2020-09-28 19:36                                                           ` Linus Torvalds
2020-09-28 19:50                                                           ` Linus Torvalds
2020-09-28 19:50                                                             ` Linus Torvalds
2020-09-28 22:51                                                             ` Jason Gunthorpe
2020-09-29  0:30                                                               ` Peter Xu
2020-10-08  5:49                                                             ` Leon Romanovsky
2020-09-28 17:13                                             ` Peter Xu
2020-09-25 21:13                                 ` Peter Xu
2020-09-25 22:08                                   ` Linus Torvalds
2020-09-25 22:08                                     ` Linus Torvalds
2020-09-22 18:02       ` John Hubbard
2020-09-22 18:15         ` Peter Xu
2020-09-22 19:11       ` John Hubbard
2020-09-27  0:41   ` [mm] 698ac7610f: will-it-scale.per_thread_ops 8.2% improvement kernel test robot
2020-09-27  0:41     ` kernel test robot
2020-09-21 21:17 ` [PATCH 2/5] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-21 21:17 ` [PATCH 3/5] mm: Rework return value for copy_one_pte() Peter Xu
2020-09-22  7:11   ` John Hubbard
2020-09-22 15:29     ` Peter Xu
2020-09-22 10:08   ` Oleg Nesterov
2020-09-22 10:18     ` Oleg Nesterov
2020-09-22 15:36       ` Peter Xu
2020-09-22 15:48         ` Oleg Nesterov
2020-09-22 16:03           ` Peter Xu
2020-09-22 16:53             ` Oleg Nesterov
2020-09-22 18:13               ` Peter Xu
2020-09-22 18:23                 ` Oleg Nesterov
2020-09-22 18:49                   ` Peter Xu
2020-09-23  6:52                     ` Oleg Nesterov
2020-09-23 17:16   ` Linus Torvalds
2020-09-23 17:16     ` Linus Torvalds
2020-09-23 21:24     ` Linus Torvalds
2020-09-23 21:24       ` Linus Torvalds
2020-09-21 21:20 ` [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-21 21:55   ` Jann Horn
2020-09-21 21:55     ` Jann Horn
2020-09-21 22:18     ` John Hubbard
2020-09-21 22:27       ` Jann Horn
2020-09-21 22:27         ` Jann Horn
2020-09-22  0:08         ` John Hubbard
2020-09-21 22:27     ` Peter Xu
2020-09-22 11:48   ` Oleg Nesterov
2020-09-22 12:40     ` Oleg Nesterov
2020-09-22 15:58       ` Peter Xu
2020-09-22 16:52         ` Oleg Nesterov
2020-09-22 18:34           ` Peter Xu
2020-09-22 18:44             ` Oleg Nesterov
2020-09-23  1:03               ` Peter Xu
2020-09-23 20:25                 ` Linus Torvalds
2020-09-23 20:25                   ` Linus Torvalds
2020-09-24 15:08                   ` Peter Xu
2020-09-24 11:48   ` Kirill Tkhai
2020-09-24 15:16     ` Peter Xu
2020-09-21 21:20 ` [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
2020-09-22  6:41   ` John Hubbard
2020-09-22 10:33     ` Jan Kara
2020-09-22 20:01       ` John Hubbard
2020-09-23  9:22         ` Jan Kara
2020-09-23 13:50           ` Peter Xu
2020-09-23 14:01             ` Jan Kara
2020-09-23 15:44               ` Peter Xu
2020-09-23 20:19                 ` John Hubbard
2020-09-24 18:49                   ` Peter Xu
2020-09-23 16:06     ` Peter Xu
2020-09-22 12:05   ` Jason Gunthorpe
2020-09-23 15:24     ` Peter Xu
2020-09-23 16:07       ` Yang Shi
2020-09-23 16:07         ` Yang Shi
2020-09-24 15:47         ` Peter Xu
2020-09-24 17:29           ` Yang Shi
2020-09-24 17:29             ` Yang Shi
2020-09-23 17:17       ` Jason Gunthorpe
2020-09-23 10:21 ` [PATCH 0/5] mm: Break COW for pinned pages during fork() Leon Romanovsky
2020-09-23 15:37   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200923002735.GN19098@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.