From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign
Date: Mon, 2 May 2016 19:00:42 +0300 [thread overview]
Message-ID: <20160502160042.GC24419@node.shutemov.name> (raw)
In-Reply-To: <20160502152307.GA12310@redhat.com>
On Mon, May 02, 2016 at 05:23:07PM +0200, Andrea Arcangeli wrote:
> On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > I don't think this would work correctly. Let's check one of callers:
> >
> > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long address, pte_t *page_table, pmd_t *pmd,
> > spinlock_t *ptl, pte_t orig_pte)
> > __releases(ptl)
> > {
> > ...
> > if (reuse_swap_page(old_page)) {
> > /*
> > * The page is all ours. Move it to our anon_vma so
> > * the rmap code will not search our parent or siblings.
> > * Protected against the rmap code by the page lock.
> > */
> > page_move_anon_rmap(old_page, vma, address);
> > unlock_page(old_page);
> > return wp_page_reuse(mm, vma, address, page_table, ptl,
> > orig_pte, old_page, 0, 0);
> > }
> >
> > The first thing to notice is that old_page can be a tail page here
> > therefore page_move_anon_rmap() should be able to handle this after you
> > patch, which it doesn't.
>
> Agreed, that's an implementation error and easy to fix.
>
> > But I think there's a bigger problem.
> >
> > Consider the following situation: after split_huge_pmd() we have
> > pte-mapped THP, fork() comes and now the pages is shared between two
> > processes. Child process munmap()s one half of the THP page, parent
> > munmap()s the other half.
> >
> > IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> > subpages have mapcount exactly one. Fault in the child would trigger
> > do_wp_page() and reuse_swap_page() returns true, which would lead to
> > page_move_anon_rmap() tranferring the whole compound page to child's
> > anon_vma. That's not correct.
> >
> > We should at least avoid page_move_anon_rmap() for compound pages there.
>
> So (compound_head() missing aside) the calculation I was doing is
> correct with regard to taking over the page and marking the pagetable
> read-write instead of triggering a COW and breaking the pinning, but
> it's not right only in terms of calling page_move_anon_rmap? The child
> or parent would then lose visibility on its ptes if the compound page
> is moved to the local vma->anon_vma.
>
> The fix should be just to change page_trans_huge_mapcount() to return
> two refcounts, one "hard" for the pinning, and one "soft" for the rmap
> which will be the same as total_mapcount. The runtime cost will remain
> the same, so a fix can be easy for this one too.
Sounds correct, but code is going to be ugly :-/
> > Other thing I would like to discuss is if there's a problem on vfio side.
> > To me it looks like vfio expects guarantee from get_user_pages() which it
> > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > is going to remain mapped into userspace until the pin is gone.
> >
> > Even with THP COW regressing fixed, vfio would stay fragile: any
> > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > broken.
>
> vfio must run as root, it will take care of not doing such things, it
> just needs a way to prevent the page to be moved so it can DMA into it
> and mlock is not enough. This clearly has to be caused by a
> get_user_pages(write=0) or by a serialized fork/exec() while a
> longstanding page pin is being held (and to be safe fork/exec had to
> be serialized in a way that the parent process wouldn't write to the
> pinned page until after exec has run in the child, or it's already
> racy no matter what kernel).
>
> I agree it's somewhat fragile, the problem here is that the THP
> refcounting change made it even weaker than it already was.
I didn't say we shouldn't fix the problem on THP side. But the attitude
"get_user_pages() would magically freeze page tables" worries me.
> Ideally the MMU notifier invalidate should be used instead of pinning
> the page, that would make it 100% robust and it wouldn't even pin the
> page at all.
>
> However we can't send an MMU notifier invalidate to an IOMMU because
> next time the IOMMU non-present physical address is used it would kill
> the app. Some new IOMMU can raise an exception synchronously that we
> could use to implement a IOMMU secondary MMU page fault to make the
> MMU notifier model work with IOMMUs too, but that's not feasible with
> most IOMMU out there that raises an unrecoverable asynchronous
> exception instead and can't implement a proper "IOMMU page
> fault". Furthermore the speed of the invalidate may not be optimal
> with IOMMUs which would then be an added cost to pay for swapping and
> memory migration.
>
> This is anyway a regression of the previous guarantees a pin would
> provide, if we want to bring back the old semantics of a page pin, I
> think fixing both places like I attempted to do (modulo two
> implementation bugs) is better than fixing only the THP case.
Agreed. I just didn't see the two-refcounts solution.
> If instead leave things as is, and we weaken the semantics of a page
> pin, the alternative to deal with the even weakened semantics inside
> the vfio code, is to use get_user_pages with write=1 forced and then
> it'll probably work also with current upstream (unless it's fork/exec,
> but I don't think it is, MADV_DONTFORK would be recommended anyway for
> usages like this with vfio if fork can ever run and there are threads
> in the parent, even O_DIRECT generates data corruption without
> MADV_DONTFORK in such conditions for similar reasons).
--
Kirill A. Shutemov
next prev parent reply other threads:[~2016-05-02 16:00 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:58 ` Alex Williamson
2016-04-28 23:21 ` Andrea Arcangeli
2016-04-29 0:44 ` Alex Williamson
2016-04-29 0:51 ` Kirill A. Shutemov
2016-04-29 2:45 ` Alex Williamson
2016-04-29 7:06 ` Kirill A. Shutemov
2016-04-29 15:12 ` Alex Williamson
2016-04-29 16:34 ` Andrea Arcangeli
2016-04-29 22:34 ` Alex Williamson
2016-05-02 10:41 ` Kirill A. Shutemov
2016-05-02 11:15 ` Jerome Glisse
2016-05-02 12:14 ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 13:39 ` Jerome Glisse
2016-05-02 15:00 ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:22 ` Jerome Glisse
2016-05-02 16:12 ` Kirill A. Shutemov
2016-05-02 19:14 ` Andrea Arcangeli
2016-05-02 19:11 ` Andrea Arcangeli
2016-05-02 19:02 ` Andrea Arcangeli
2016-05-02 14:15 ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 16:21 ` Kirill A. Shutemov
2016-05-02 16:22 ` Oleg Nesterov
2016-05-02 18:03 ` Kirill A. Shutemov
2016-05-02 17:41 ` Oleg Nesterov
2016-05-02 18:56 ` Andrea Arcangeli
2016-05-02 15:23 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 16:00 ` Kirill A. Shutemov [this message]
2016-05-02 18:03 ` Andrea Arcangeli
2016-05-05 1:19 ` Alex Williamson
2016-05-05 14:39 ` Andrea Arcangeli
2016-05-05 15:09 ` Andrea Arcangeli
2016-05-05 15:11 ` Kirill A. Shutemov
2016-05-05 15:24 ` Andrea Arcangeli
2016-05-06 7:29 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160502160042.GC24419@node.shutemov.name \
--to=kirill@shutemov.name \
--cc=aarcange@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).