All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	Kirill Shutemov <kirill@shutemov.name>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Jan Kara <jack@suse.cz>, Kirill Tkhai <ktkhai@virtuozzo.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
Date: Tue, 15 Sep 2020 17:33:30 -0400	[thread overview]
Message-ID: <20200915213330.GE2949@xz-x1> (raw)
In-Reply-To: <20200915193838.GN1221970@ziepe.ca>

On Tue, Sep 15, 2020 at 04:38:38PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 03:13:46PM -0400, Peter Xu wrote:
> > On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > > > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > > > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > > > > doesn't establish a COW that would break the DMA by moving the
> > > > > > physical page over to the fork. DMA should stay with the process that
> > > > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > > > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> > > > > 
> > > > > ... Now I'm more confused on what has happened.
> > > > 
> > > > I'm going to try to confirm that the MADV_DONTFORK is actually being
> > > > done by userspace properly, more later.
> > > 
> > > It turns out the test is broken and does not call MADV_DONTFORK when
> > > doing forks - it is an opt-in it didn't do.
> > > 
> > > It looks to me like this patch makes it much more likely that the COW
> > > break after page pinning will end up moving the pinned physical page
> > > to the fork while before it was not very common. Does that make sense?
> > 
> > My understanding is that the fix should not matter much with current failing
> > test case, as long as it's with FOLL_FORCE & FOLL_WRITE.  However what I'm not
> > sure is what if the RDMA/DMA buffers are designed for pure read from userspace.
> 
> No, they are write. Always FOLL_WRITE.
> 
> > E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read
> > buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to
> > pin_user_pages_remote().  So what I'm worrying is something like this:
> 
> I think the !(prot & IOMMU_WRITE) case is probably very rare for
> VFIO. I'm also not sure it will work reliably, in RDMA we had this as
> a more common case and long ago found bugs. The COW had to be broken
> for the pin anyhow.

If I'm not wrong.. QEMU/KVM (assuming there's one vIOMMU in the guest) will try
to do VFIO maps in this read-only way if the IOVA mapped in the guest points to
read only buffers (say, allocated with PCI_DMA_FROMDEVICE).

> 
> >   1. Proc A gets a private anon page X for DMA, mapcount==refcount==1.
> > 
> >   2. Proc A fork()s and gives birth to proc B, page X will now have
> >      mapcount==refcount==2, write-protected.  proc B quits.  Page X goes back
> >      to mapcount==refcount==1 (note! without WRITE bits set in the PTE).
> 
> >   3. pin_user_pages(write=false) for page X.  Since it's with !FORCE & !WRITE,
> >      no COW needed.  Refcount==2 after that.
> > 
> >   4. Pass these pages to device.  We either setup IOMMU page table or just use
> >      the PFNs, which is not important imho - the most important thing is the
> >      device will DMA into page X no matter what.
> > 
> >   5. Some thread of proc A writes to page X, trigger COW since it's
> >      write-protected with mapcount==1 && refcount==2.  The HVA that pointing to
> >      page X will be changed to point to another page Y after the COW.
> > 
> >   6. Device DMA happens, data resides on X.  Proc A can never get the data,
> >      though, because it's looking at page Y now.
> 
> RDMA doesn't ever use !WRITE
> 
> I'm guessing #5 is the issue, just with a different ordering. If the
> #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?

Right, but only if without MADV_DONTFORK?  When without MADV_DONTFORK I'll
probably still see that as an userspace bug instead of a kernel one when the
userspace decided to fork() after step #3.

> 
> > If this is a problem, we may still need the fix patch (maybe not as urgent as
> > before at least).  But I'd like to double confirm, just in case I miss some
> > obvious facts above.
> 
> I'm worred that the sudden need to have MAD_DONTFORK is going to be a
> turn into a huge regression. It already blew up our first level of
> synthetic test cases. I'm worried what we will see when the
> application suite is run in a few months :\

For my own preference I'll consider changing kernel behavior if the impact is
still under control (the performance report of 30%+ boost is also attractive
after the simplify-cow patch).  The other way is to maintain the old reuse
logic forever, that'll be another kind of burden.  Seems no easy way on either
side...

> 
> > > Given that the tests are wrong it seems like broken userspace,
> > > however, it also worked reliably for a fairly long time.
> > 
> > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> > previously just as-is even after the fork without MADV_DONTFORK and after the
> > child quits.
> 
> That would match the results we see.. So this patch changes things so
> it is not re-used as-is, but replaced with Y?

Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3.  Then
with MADV_DONTFORK, reuse should not happen any more.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2020-09-15 21:38 UTC|newest]

Thread overview: 114+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 23:49 [PATCH 0/4] mm: Simplfy cow handling Peter Xu
2020-08-21 23:49 ` [PATCH 1/4] mm: Trial do_wp_page() simplification Peter Xu
2020-08-24  8:36   ` Kirill Tkhai
2020-08-24 14:30     ` Jan Kara
2020-08-24 15:37       ` Kirill Tkhai
2020-08-24 18:22         ` Linus Torvalds
2020-08-24 18:22           ` Linus Torvalds
2020-09-01  7:01           ` Hugh Dickins
2020-09-01  7:01             ` Hugh Dickins
2020-09-14 14:38   ` Jason Gunthorpe
2020-09-14 17:32     ` Linus Torvalds
2020-09-14 17:32       ` Linus Torvalds
2020-09-14 18:34       ` Peter Xu
2020-09-14 21:15         ` Peter Xu
2020-09-14 22:55           ` Jason Gunthorpe
2020-09-14 22:59             ` Linus Torvalds
2020-09-14 22:59               ` Linus Torvalds
2020-09-14 23:28               ` Jason Gunthorpe
2020-09-15  0:19                 ` Linus Torvalds
2020-09-15  0:19                   ` Linus Torvalds
2020-09-15 14:50                 ` Peter Xu
2020-09-15 15:17                   ` Peter Xu
2020-09-15 16:05                   ` Jason Gunthorpe
2020-09-15 18:29                     ` Jason Gunthorpe
2020-09-15 19:13                       ` Peter Xu
2020-09-15 19:38                         ` Jason Gunthorpe
2020-09-15 21:33                           ` Peter Xu [this message]
2020-09-15 23:22                             ` Jason Gunthorpe
2020-09-16  1:50                               ` John Hubbard
2020-09-16 17:48                                 ` Jason Gunthorpe
2020-09-16 18:46                                   ` Peter Xu
2020-09-17 11:25                                     ` Jason Gunthorpe
2020-09-17 18:11                                       ` Linus Torvalds
2020-09-17 18:11                                         ` Linus Torvalds
2020-09-17 19:38                                         ` Jason Gunthorpe
2020-09-17 19:51                                           ` Linus Torvalds
2020-09-17 19:51                                             ` Linus Torvalds
2020-09-18 16:40                                             ` Peter Xu
2020-09-18 17:16                                               ` Linus Torvalds
2020-09-18 17:16                                                 ` Linus Torvalds
2020-09-18 19:57                                                 ` Peter Xu
2020-09-18 17:32                                               ` Jason Gunthorpe
2020-09-18 20:40                                                 ` Peter Xu
2020-09-18 20:59                                                   ` Linus Torvalds
2020-09-18 20:59                                                     ` Linus Torvalds
2020-09-19  0:28                                                     ` Jason Gunthorpe
2020-09-18 21:06                                                   ` John Hubbard
2020-09-19  0:01                                                     ` Jason Gunthorpe
2020-09-21  8:35                                                       ` Jan Kara
2020-09-21 12:03                                                         ` Jason Gunthorpe
2022-02-16 16:59                                                           ` Oded Gabbay
2022-02-16 17:24                                                             ` Oded Gabbay
2022-02-16 19:04                                                             ` Linus Torvalds
2022-02-16 19:20                                                               ` Oded Gabbay
2022-02-16 19:24                                                               ` David Hildenbrand
2020-09-21 13:42                                               ` Michal Hocko
2020-09-21 14:18                                                 ` Peter Xu
2020-09-21 14:28                                                   ` Michal Hocko
2020-09-21 14:38                                                     ` Tejun Heo
2020-09-21 14:43                                                       ` Christian Brauner
2020-09-21 14:55                                                         ` Michal Hocko
2020-09-21 15:04                                                           ` Christian Brauner
2020-09-21 16:06                                                             ` Michal Hocko
2020-09-23  7:53                                                               ` Michal Hocko
2020-09-21 14:41                                                 ` Christian Brauner
2020-09-21 14:57                                                   ` Michal Hocko
2020-09-21 16:31                                                     ` Peter Xu
2020-09-17 18:14                                       ` Peter Xu
2020-09-17 18:26                                         ` Linus Torvalds
2020-09-17 18:26                                           ` Linus Torvalds
2020-09-17 19:03                                           ` Peter Xu
2020-09-17 19:42                                             ` Linus Torvalds
2020-09-17 19:42                                               ` Linus Torvalds
2020-09-17 19:55                                               ` John Hubbard
2020-09-17 20:06                                               ` Jason Gunthorpe
2020-09-17 20:19                                                 ` John Hubbard
2020-09-17 20:25                                                   ` Jason Gunthorpe
2020-09-17 20:35                                                 ` Linus Torvalds
2020-09-17 20:35                                                   ` Linus Torvalds
2020-09-17 21:40                                                   ` Peter Xu
2020-09-17 22:09                                                     ` Jason Gunthorpe
2020-09-17 22:25                                                       ` Linus Torvalds
2020-09-17 22:25                                                         ` Linus Torvalds
2020-09-17 22:48                                                       ` Ira Weiny
2020-09-18  9:36                                                         ` Jan Kara
2020-09-18  9:44                                                       ` Jan Kara
2020-09-18 16:19                                             ` Jason Gunthorpe
2020-09-15 10:23           ` Leon Romanovsky
2020-09-15 15:56           ` Jason Gunthorpe
2020-09-15 15:03   ` Oleg Nesterov
2020-09-15 16:18     ` Peter Xu
2020-08-21 23:49 ` [PATCH 2/4] mm/ksm: Remove reuse_ksm_page() Peter Xu
2020-08-21 23:49 ` [PATCH 3/4] mm/gup: Remove enfornced COW mechanism Peter Xu
2020-09-14 14:27   ` Oleg Nesterov
2020-09-14 17:59     ` Peter Xu
2020-09-14 19:03       ` Linus Torvalds
2020-09-14 19:03         ` Linus Torvalds
2020-08-21 23:49 ` [PATCH 4/4] mm: Add PGREUSE counter Peter Xu
2020-08-22 16:14   ` Linus Torvalds
2020-08-22 16:14     ` Linus Torvalds
2020-08-24  0:24     ` Peter Xu
2020-08-22 16:05 ` [PATCH 0/4] mm: Simplfy cow handling Linus Torvalds
2020-08-22 16:05   ` Linus Torvalds
2020-08-23 23:58   ` Peter Xu
2020-08-24  8:38 ` Kirill Tkhai
2020-08-27 14:15 ` Peter Xu
2021-02-02 14:40 [PATCH 1/4] mm: Trial do_wp_page() simplification Gal Pressman
2021-02-02 16:31 ` Peter Xu
2021-02-02 16:44   ` Jason Gunthorpe
2021-02-02 17:05     ` Peter Xu
2021-02-02 17:13       ` Jason Gunthorpe
2021-02-03 12:43         ` Gal Pressman
2021-02-03 14:00           ` Jason Gunthorpe
2021-02-03 14:47             ` Gal Pressman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200915213330.GE2949@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=gokhale2@llnl.gov \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcfadden8@llnl.gov \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.