All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>, Christoph Hellwig <hch@lst.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
Date: Thu, 24 Sep 2020 10:35:17 -0400	[thread overview]
Message-ID: <20200924143517.GD79898@xz-x1> (raw)
In-Reply-To: <20200923170759.GA9916@ziepe.ca>

On Wed, Sep 23, 2020 at 02:07:59PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> > On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 8f3521be80ca..6591f3f33299 100644
> > > > +++ b/mm/memory.c
> > > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >                  * Because we'll need to release the locks before doing cow,
> > > >                  * pass this work to upper layer.
> > > >                  */
> > > > -               if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > > -                   page_maybe_dma_pinned(page)) {
> > > > +               if (wp && page_maybe_dma_pinned(page) &&
> > > > +                   READ_ONCE(src_mm->has_pinned)) {
> > > >                         /* We've got the page already; we're safe */
> > > >                         data->cow_old_page = page;
> > > >                         data->cow_oldpte = *src_pte;
> > > > 
> > > > I can also add some more comment to emphasize this.
> > > 
> > > It is not just that, but the ptep_set_wrprotect() has to be done
> > > earlier.
> > 
> > Now I understand your point, I think..  So I guess it's not only about
> > has_pinned, but it should be a race between the fast-gup and the fork() code,
> > even if has_pinned is always set.
> 
> Yes
> 
> > > The best algorithm I've thought of is something like:
> > > 
> > >  pte_map_lock()
> > >   if (page) {
> > >       if (wp) {
> > > 	  ptep_set_wrprotect()
> > > 	  /* Order with try_grab_compound_head(), either we see
> > > 	   * page_maybe_dma_pinned(), or they see the wrprotect */
> > > 	  get_page();
> > 
> > Is this get_page() a must to be after ptep_set_wrprotect()
> > explicitly?  
> 
> No, just before page_maybe_dma_pinned()
> 
> > IIUC what we need is to order ptep_set_wrprotect() and
> > page_maybe_dma_pinned() here.  E.g., would a "mb()" work?
> 
> mb() is not needed because page_maybe_dma_pinned() has an atomic
> barrier too. I like to see get_page() followed immediately by
> page_maybe_dma_pinned() since they are accessing the same atomic and
> could be fused together someday

If so, I'd hope you won't disagree that I still move the get_page() out of the
"if (wp)".  Not only it's a shared operation no matter whether "if (wp)" or
not, but I'm afraid it would confuse future readers on a special ordering on
the get_page() and the wrprotect(), especially with the comment above.

> 
> > Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> > to guarantee ordering of try_grab_compound_head() and the pte change
> > check?
> 
> gup_pte_range() is as I quoted? The gup slow path ends up in
> follow_page_pte() which uses the pte lock so is OK.
> > 
> > Another question is, how about read fast-gup for pinning?  Because we can't use
> > the write-protect mechanism to block a read gup.  I remember we've discussed
> > similar things and iirc your point is "pinned pages should always be with
> > WRITE".  However now I still doubt it...  Because I feel like read gup is still
> > legal (as I mentioned previously - when device purely writes to the page and
> > the processor only reads from it).
> 
> We need a definition for what FOLL_PIN means. After this work on fork
> I propose that FOLL_PIN means:
> 
>   The page is in-use for DMA and the CPU PTE should not be changed
>   without explicit involvement of the application (eg via mmap/munmap)
> 
> If GUP encounters a read-only page during FOLL_PIN the behavior should
> depend on what the fault handler would do. If the fault handler would
> trigger COW and replace the PTE then it violates the above. GUP should
> do the COW before pinning.
> 
> If the fault handler would SIGSEGV then GUP can keep the read-only
> page and allow !FOLL_WRITE access. The PTE should not be replaced for
> other reasons (though I think there is work there too).
> 
> For COW related issues the idea is the mm_struct doing the pin will
> never trigger a COW. When other processes hit the COW they copy the
> page into their mm and don't touch the source MM's PTE.
> 
> Today we do this roughly with FOLL_FORCE and FOLL_WRITE in the users,
> but a more nuanced version and documentation would be much clearer.
> 
> Unfortunately just doing simple read GUP potentially exposes things to
> various COW related data corruption races.
> 
> This is a discussion beyond this series though..

Yes.  It's kind of related here on whether we can still use wrprotect() to
guard against fast-gup, though.  So my understanding is that we should still at
least need the other patch [1] that I proposed in the other thread to force
break-cow for read-only gups (that patch is not only for fast-gup, of course).

But I agree that should be another bigger topic.  I hope we don't need to pick
that patch up someday by another dma report on read-only pinned pages...

Regarding the solution here, I think we can also cover read-only fast-gup too
in the future - IIUC what we need to do is to make it pte_protnone() instead of
pte_wrprotect(), then in the fault handler we should identify this special
pte_protnone() against numa balancing (change_prot_numa()).  I think it should
work fine too, iiuc, because I don't think we should migrate a page at all if
it's pinned for any reason...

So I think I'll focus on the wrprotect() solution for now.  Thanks!

[1] https://lore.kernel.org/lkml/20200915151746.GB2949@xz-x1/

-- 
Peter Xu


  reply	other threads:[~2020-09-24 14:35 UTC|newest]

Thread overview: 133+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21 21:17 [PATCH 0/5] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-21 21:17 ` [PATCH 1/5] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-21 21:43   ` Jann Horn
2020-09-21 21:43     ` Jann Horn
2020-09-21 22:30     ` Peter Xu
2020-09-21 22:47       ` Jann Horn
2020-09-21 22:47         ` Jann Horn
2020-09-22 11:54         ` Jason Gunthorpe
2020-09-22 14:28           ` Peter Xu
2020-09-22 15:56             ` Jason Gunthorpe
2020-09-22 16:25               ` Linus Torvalds
2020-09-22 16:25                 ` Linus Torvalds
2020-09-21 23:53   ` John Hubbard
2020-09-22  0:01     ` John Hubbard
2020-09-22 15:17     ` Peter Xu
2020-09-22 16:10       ` Jason Gunthorpe
2020-09-22 17:54         ` Peter Xu
2020-09-22 19:11           ` Jason Gunthorpe
2020-09-23  0:27             ` Peter Xu
2020-09-23 13:10               ` Peter Xu
2020-09-23 14:20                 ` Jan Kara
2020-09-23 17:12                   ` Jason Gunthorpe
2020-09-24  7:44                     ` Jan Kara
2020-09-24 14:02                       ` Jason Gunthorpe
2020-09-24 14:45                         ` Jan Kara
2020-09-23 17:07               ` Jason Gunthorpe
2020-09-24 14:35                 ` Peter Xu [this message]
2020-09-24 16:51                   ` Jason Gunthorpe
2020-09-24 17:55                     ` Peter Xu
2020-09-24 18:15                       ` Jason Gunthorpe
2020-09-24 18:34                         ` Peter Xu
2020-09-24 18:39                           ` Jason Gunthorpe
2020-09-24 21:30                             ` Peter Xu
2020-09-25 19:56                               ` Linus Torvalds
2020-09-25 19:56                                 ` Linus Torvalds
2020-09-25 21:06                                 ` Linus Torvalds
2020-09-25 21:06                                   ` Linus Torvalds
2020-09-26  0:41                                   ` Jason Gunthorpe
2020-09-26  1:15                                     ` Linus Torvalds
2020-09-26  1:15                                       ` Linus Torvalds
2020-09-26 22:28                                       ` Linus Torvalds
2020-09-26 22:28                                         ` Linus Torvalds
2020-09-27  6:23                                         ` Leon Romanovsky
2020-09-27 18:16                                           ` Linus Torvalds
2020-09-27 18:16                                             ` Linus Torvalds
2020-09-27 18:45                                             ` Linus Torvalds
2020-09-27 18:45                                               ` Linus Torvalds
2020-09-28 12:49                                               ` Jason Gunthorpe
2020-09-28 16:17                                                 ` Linus Torvalds
2020-09-28 16:17                                                   ` Linus Torvalds
2020-09-28 17:22                                                   ` Peter Xu
2020-09-28 17:54                                                     ` Linus Torvalds
2020-09-28 17:54                                                       ` Linus Torvalds
2020-09-28 18:39                                                       ` Jason Gunthorpe
2020-09-28 19:29                                                         ` Linus Torvalds
2020-09-28 19:29                                                           ` Linus Torvalds
2020-09-28 23:57                                                           ` Jason Gunthorpe
2020-09-29  0:18                                                             ` John Hubbard
2020-09-28 19:36                                                         ` Linus Torvalds
2020-09-28 19:36                                                           ` Linus Torvalds
2020-09-28 19:50                                                           ` Linus Torvalds
2020-09-28 19:50                                                             ` Linus Torvalds
2020-09-28 22:51                                                             ` Jason Gunthorpe
2020-09-29  0:30                                                               ` Peter Xu
2020-10-08  5:49                                                             ` Leon Romanovsky
2020-09-28 17:13                                             ` Peter Xu
2020-09-25 21:13                                 ` Peter Xu
2020-09-25 22:08                                   ` Linus Torvalds
2020-09-25 22:08                                     ` Linus Torvalds
2020-09-22 18:02       ` John Hubbard
2020-09-22 18:15         ` Peter Xu
2020-09-22 19:11       ` John Hubbard
2020-09-27  0:41   ` [mm] 698ac7610f: will-it-scale.per_thread_ops 8.2% improvement kernel test robot
2020-09-27  0:41     ` kernel test robot
2020-09-21 21:17 ` [PATCH 2/5] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-21 21:17 ` [PATCH 3/5] mm: Rework return value for copy_one_pte() Peter Xu
2020-09-22  7:11   ` John Hubbard
2020-09-22 15:29     ` Peter Xu
2020-09-22 10:08   ` Oleg Nesterov
2020-09-22 10:18     ` Oleg Nesterov
2020-09-22 15:36       ` Peter Xu
2020-09-22 15:48         ` Oleg Nesterov
2020-09-22 16:03           ` Peter Xu
2020-09-22 16:53             ` Oleg Nesterov
2020-09-22 18:13               ` Peter Xu
2020-09-22 18:23                 ` Oleg Nesterov
2020-09-22 18:49                   ` Peter Xu
2020-09-23  6:52                     ` Oleg Nesterov
2020-09-23 17:16   ` Linus Torvalds
2020-09-23 17:16     ` Linus Torvalds
2020-09-23 21:24     ` Linus Torvalds
2020-09-23 21:24       ` Linus Torvalds
2020-09-21 21:20 ` [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-21 21:55   ` Jann Horn
2020-09-21 21:55     ` Jann Horn
2020-09-21 22:18     ` John Hubbard
2020-09-21 22:27       ` Jann Horn
2020-09-21 22:27         ` Jann Horn
2020-09-22  0:08         ` John Hubbard
2020-09-21 22:27     ` Peter Xu
2020-09-22 11:48   ` Oleg Nesterov
2020-09-22 12:40     ` Oleg Nesterov
2020-09-22 15:58       ` Peter Xu
2020-09-22 16:52         ` Oleg Nesterov
2020-09-22 18:34           ` Peter Xu
2020-09-22 18:44             ` Oleg Nesterov
2020-09-23  1:03               ` Peter Xu
2020-09-23 20:25                 ` Linus Torvalds
2020-09-23 20:25                   ` Linus Torvalds
2020-09-24 15:08                   ` Peter Xu
2020-09-24 11:48   ` Kirill Tkhai
2020-09-24 15:16     ` Peter Xu
2020-09-21 21:20 ` [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
2020-09-22  6:41   ` John Hubbard
2020-09-22 10:33     ` Jan Kara
2020-09-22 20:01       ` John Hubbard
2020-09-23  9:22         ` Jan Kara
2020-09-23 13:50           ` Peter Xu
2020-09-23 14:01             ` Jan Kara
2020-09-23 15:44               ` Peter Xu
2020-09-23 20:19                 ` John Hubbard
2020-09-24 18:49                   ` Peter Xu
2020-09-23 16:06     ` Peter Xu
2020-09-22 12:05   ` Jason Gunthorpe
2020-09-23 15:24     ` Peter Xu
2020-09-23 16:07       ` Yang Shi
2020-09-23 16:07         ` Yang Shi
2020-09-24 15:47         ` Peter Xu
2020-09-24 17:29           ` Yang Shi
2020-09-24 17:29             ` Yang Shi
2020-09-23 17:17       ` Jason Gunthorpe
2020-09-23 10:21 ` [PATCH 0/5] mm: Break COW for pinned pages during fork() Leon Romanovsky
2020-09-23 15:37   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200924143517.GD79898@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.