All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jann Horn <jannh@google.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
	Kirill Shutemov <kirill@shutemov.name>, Jan Kara <jack@suse.cz>
Subject: Re: [PATCH v3] mm/gup: Allow real explicit breaking of COW
Date: Tue, 11 Aug 2020 17:23:45 -0400	[thread overview]
Message-ID: <20200811212345.GD6353@xz-x1> (raw)
In-Reply-To: <CAG48ez1w5HZxUNPGEsDfLGFCoA-Knpr4jvAAtZMDyEsts6Hyng@mail.gmail.com>

On Tue, Aug 11, 2020 at 10:22:00PM +0200, Jann Horn wrote:
> On Tue, Aug 11, 2020 at 10:03 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, Aug 11, 2020 at 09:07:17PM +0200, Jann Horn wrote:
> > > On Tue, Aug 11, 2020 at 8:39 PM Peter Xu <peterx@redhat.com> wrote:
> > > > Starting from commit 17839856fd58 ("gup: document and work around "COW can
> > > > break either way" issue", 2020-06-02), explicit copy-on-write behavior is
> > > > enforced for private gup pages even if it's a read-only.  It is achieved by
> > > > always passing FOLL_WRITE to emulate a write.
> > > >
> > > > That should fix the COW issue that we were facing, however above commit could
> > > > also break userfaultfd-wp and applications like umapsort [1,2].
> > > >
> > > > One general routine of umap-like program is: userspace library will manage page
> > > > allocations, and it will evict the least recently used pages from memory to
> > > > external storages (e.g., file systems).  Below are the general steps to evict
> > > > an in-memory page in the uffd service thread when the page pool is full:
> > > >
> > > >   (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
> > > >       further writes to page P will block (keep page P clean)
> > > >   (2) Copy page P to external storage (e.g. file system)
> > > >   (3) MADV_DONTNEED to evict page P
> > > >
> > > > Here step (1) makes sure that the page to dump will always be up-to-date, so
> > > > that the page snapshot in the file system is consistent with the one that was
> > > > in the memory.  However with commit 17839856fd58, step (2) can potentially hang
> > > > itself because e.g. if we use write() to a file system fd to dump the page
> > > > data, that will be a translated read gup request in the file system driver to
> > > > read the page content, then the read gup will be translated to a write gup due
> > > > to the new enforced COW behavior.  This write gup will further trigger
> > > > handle_userfault() and hang the uffd service thread itself.
> > > >
> > > > I think the problem will go away too if we replace the write() to the file
> > > > system into a memory write to a mmaped region in the userspace library, because
> > > > normal page faults will not enforce COW, only gup is affected.  However we
> > > > cannot forbid users to use write() or any form of kernel level read gup.
> > > >
> > > > One solution is actually already mentioned in commit 17839856fd58, which is to
> > > > provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> > > > use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> > > > "enfornced COW (read) request".
> > > >
> > > > With the enforced COW, we also need to inherit UFFD_WP bit during COW because
> > > > now COW can happen with UFFD_WP enabled (previously, it cannot).
> [...]
> > > > @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > > >                         }
> > > >                         if (is_vm_hugetlb_page(vma)) {
> > > >                                 if (should_force_cow_break(vma, foll_flags))
> > > > -                                       foll_flags |= FOLL_WRITE;
> > > > +                                       foll_flags |= FOLL_BREAK_COW;
> > >
> > > How does this interact with the FOLL_WRITE check in follow_page_pte()?
> > > If we want the COW to be broken, follow_page_pte() would have to treat
> > > FOLL_BREAK_COW similarly to FOLL_WRITE, right?
> >
> > Good point...  I did checked follow_page_mask() that FOLL_COW will still be set
> > correctly after applying the patch, though I forgot the FOLL_WRITE part.
> >
> > Does below look good to you?
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 9d1f44b01165..f4f2a69c6fe7 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -439,7 +439,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> >         }
> >         if ((flags & FOLL_NUMA) && pte_protnone(pte))
> >                 goto no_page;
> > -       if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
> > +       if ((flags & (FOLL_WRITE | FOLL_BREAK_COW)) &&
> > +           !can_follow_write_pte(pte, flags)) {
> >                 pte_unmap_unlock(ptep, ptl);
> >                 return NULL;
> >         }
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4f192efef37c..edbd42c9d576 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1340,7 +1340,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> >
> >         assert_spin_locked(pmd_lockptr(mm, pmd));
> >
> > -       if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
> > +       if (flags & (FOLL_WRITE | FOLL_BREAK_COW) &&
> > +           !can_follow_write_pmd(*pmd, flags))
> >                 goto out;
> >
> >         /* Avoid dumping huge zero page */
> 
> Well, I don't see anything immediately wrong with it, at least. Not
> that that means much...
> 
> Although in addition to the normal-page path and the transhuge path,
> you'll probably also have to make the same change in the hugetlb path.
> I guess you may have to grep through all the uses of FOLL_WRITE, as
> Linus suggested, to see if there are any other missing spots.

Right.  Moreover, I feel like the enforced cow is not completely done in
hugetlbfs code even without this patch.  E.g., it lacks the proper return of
VM_FAULT_WRITE in hugetlb_cow(), and also the further convertion logics to
translate that into FOLL_COW (which, iiuc, should probably be done in
follow_hugetlb_page()).

I quickly went over the other FOLL_WRITE references and I didn't see any other
suspicious spots besides hugetlb (I only looked at the places that can be
called by __get_user_pages() though; that's the only place we set
FOLL_BREAK_COW after all).  At the meantime we ignored the fast gups for strict
breaking of cow always, so those ones seem ok too.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2020-08-11 21:23 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-11 18:39 [PATCH v3] mm/gup: Allow real explicit breaking of COW Peter Xu
2020-08-11 19:07 ` Jann Horn
2020-08-11 19:07   ` Jann Horn
2020-08-11 20:02   ` Peter Xu
2020-08-11 20:22     ` Jann Horn
2020-08-11 20:22       ` Jann Horn
2020-08-11 21:23       ` Peter Xu [this message]
2020-08-11 19:24 ` Linus Torvalds
2020-08-11 19:24   ` Linus Torvalds
2020-08-11 20:06   ` Linus Torvalds
2020-08-11 20:06     ` Linus Torvalds
2020-08-11 20:46     ` Linus Torvalds
2020-08-11 20:46       ` Linus Torvalds
2020-08-11 21:42       ` Peter Xu
2020-08-11 23:10         ` Linus Torvalds
2020-08-11 23:10           ` Linus Torvalds
2020-08-20 21:54           ` Peter Xu
2020-08-20 22:01             ` Linus Torvalds
2020-08-20 22:01               ` Linus Torvalds
2020-08-21  2:34               ` Peter Xu
2020-08-21 10:13               ` Jan Kara
2020-08-21 12:27                 ` Linus Torvalds
2020-08-21 12:27                   ` Linus Torvalds
2020-08-21 15:47                   ` Jan Kara
2020-08-21 17:00                     ` Linus Torvalds
2020-08-21 17:00                       ` Linus Torvalds
2020-08-21 18:08                       ` Peter Xu
2020-08-21 18:23                         ` Linus Torvalds
2020-08-21 18:23                           ` Linus Torvalds
2020-08-21 19:05                           ` Linus Torvalds
2020-08-21 19:05                             ` Linus Torvalds
2020-08-21 19:06                             ` Linus Torvalds
2020-08-21 19:06                               ` Linus Torvalds
2020-08-21 19:31                           ` Peter Xu
2020-08-21 19:42                             ` Linus Torvalds
2020-08-21 19:42                               ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200811212345.GD6353@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=gokhale2@llnl.gov \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcfadden8@llnl.gov \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.