linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign
Date: Mon, 2 May 2016 19:00:42 +0300	[thread overview]
Message-ID: <20160502160042.GC24419@node.shutemov.name> (raw)
In-Reply-To: <20160502152307.GA12310@redhat.com>

On Mon, May 02, 2016 at 05:23:07PM +0200, Andrea Arcangeli wrote:
> On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > I don't think this would work correctly. Let's check one of callers:
> > 
> > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> > 		spinlock_t *ptl, pte_t orig_pte)
> > 	__releases(ptl)
> > {
> > ...
> > 		if (reuse_swap_page(old_page)) {
> > 			/*
> > 			 * The page is all ours.  Move it to our anon_vma so
> > 			 * the rmap code will not search our parent or siblings.
> > 			 * Protected against the rmap code by the page lock.
> > 			 */
> > 			page_move_anon_rmap(old_page, vma, address);
> > 			unlock_page(old_page);
> > 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> > 					     orig_pte, old_page, 0, 0);
> > 		}
> > 
> > The first thing to notice is that old_page can be a tail page here
> > therefore page_move_anon_rmap() should be able to handle this after you
> > patch, which it doesn't.
> 
> Agreed, that's an implementation error and easy to fix.
> 
> > But I think there's a bigger problem.
> > 
> > Consider the following situation: after split_huge_pmd() we have
> > pte-mapped THP, fork() comes and now the pages is shared between two
> > processes. Child process munmap()s one half of the THP page, parent
> > munmap()s the other half.
> > 
> > IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> > subpages have mapcount exactly one. Fault in the child would trigger
> > do_wp_page() and reuse_swap_page() returns true, which would lead to
> > page_move_anon_rmap() tranferring the whole compound page to child's
> > anon_vma. That's not correct.
> > 
> > We should at least avoid page_move_anon_rmap() for compound pages there.
> 
> So (compound_head() missing aside) the calculation I was doing is
> correct with regard to taking over the page and marking the pagetable
> read-write instead of triggering a COW and breaking the pinning, but
> it's not right only in terms of calling page_move_anon_rmap? The child
> or parent would then lose visibility on its ptes if the compound page
> is moved to the local vma->anon_vma.
> 
> The fix should be just to change page_trans_huge_mapcount() to return
> two refcounts, one "hard" for the pinning, and one "soft" for the rmap
> which will be the same as total_mapcount. The runtime cost will remain
> the same, so a fix can be easy for this one too.

Sounds correct, but code is going to be ugly :-/

> > Other thing I would like to discuss is if there's a problem on vfio side.
> > To me it looks like vfio expects guarantee from get_user_pages() which it
> > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > is going to remain mapped into userspace until the pin is gone.
> > 
> > Even with THP COW regressing fixed, vfio would stay fragile: any
> > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > broken.
> 
> vfio must run as root, it will take care of not doing such things, it
> just needs a way to prevent the page to be moved so it can DMA into it
> and mlock is not enough. This clearly has to be caused by a
> get_user_pages(write=0) or by a serialized fork/exec() while a
> longstanding page pin is being held (and to be safe fork/exec had to
> be serialized in a way that the parent process wouldn't write to the
> pinned page until after exec has run in the child, or it's already
> racy no matter what kernel).
> 
> I agree it's somewhat fragile, the problem here is that the THP
> refcounting change made it even weaker than it already was.

I didn't say we shouldn't fix the problem on THP side. But the attitude
"get_user_pages() would magically freeze page tables" worries me.

> Ideally the MMU notifier invalidate should be used instead of pinning
> the page, that would make it 100% robust and it wouldn't even pin the
> page at all.
> 
> However we can't send an MMU notifier invalidate to an IOMMU because
> next time the IOMMU non-present physical address is used it would kill
> the app. Some new IOMMU can raise an exception synchronously that we
> could use to implement a IOMMU secondary MMU page fault to make the
> MMU notifier model work with IOMMUs too, but that's not feasible with
> most IOMMU out there that raises an unrecoverable asynchronous
> exception instead and can't implement a proper "IOMMU page
> fault". Furthermore the speed of the invalidate may not be optimal
> with IOMMUs which would then be an added cost to pay for swapping and
> memory migration.
> 
> This is anyway a regression of the previous guarantees a pin would
> provide, if we want to bring back the old semantics of a page pin, I
> think fixing both places like I attempted to do (modulo two
> implementation bugs) is better than fixing only the THP case.

Agreed. I just didn't see the two-refcounts solution.

> If instead leave things as is, and we weaken the semantics of a page
> pin, the alternative to deal with the even weakened semantics inside
> the vfio code, is to use get_user_pages with write=1 forced and then
> it'll probably work also with current upstream (unless it's fork/exec,
> but I don't think it is, MADV_DONTFORK would be recommended anyway for
> usages like this with vfio if fork can ever run and there are threads
> in the parent, even O_DIRECT generates data corruption without
> MADV_DONTFORK in such conditions for similar reasons).

-- 
 Kirill A. Shutemov

  reply	other threads:[~2016-05-02 16:00 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:58   ` Alex Williamson
2016-04-28 23:21     ` Andrea Arcangeli
2016-04-29  0:44       ` Alex Williamson
2016-04-29  0:51       ` Kirill A. Shutemov
2016-04-29  2:45         ` Alex Williamson
2016-04-29  7:06           ` Kirill A. Shutemov
2016-04-29 15:12             ` Alex Williamson
2016-04-29 16:34             ` Andrea Arcangeli
2016-04-29 22:34               ` Alex Williamson
2016-05-02 10:41               ` Kirill A. Shutemov
2016-05-02 11:15                 ` Jerome Glisse
2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 13:39                     ` Jerome Glisse
2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:22                         ` Jerome Glisse
2016-05-02 16:12                           ` Kirill A. Shutemov
2016-05-02 19:14                             ` Andrea Arcangeli
2016-05-02 19:11                           ` Andrea Arcangeli
2016-05-02 19:02                         ` Andrea Arcangeli
2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 16:21                       ` Kirill A. Shutemov
2016-05-02 16:22                         ` Oleg Nesterov
2016-05-02 18:03                           ` Kirill A. Shutemov
2016-05-02 17:41                             ` Oleg Nesterov
2016-05-02 18:56                     ` Andrea Arcangeli
2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 16:00                   ` Kirill A. Shutemov [this message]
2016-05-02 18:03                     ` Andrea Arcangeli
2016-05-05  1:19                       ` Alex Williamson
2016-05-05 14:39                         ` Andrea Arcangeli
2016-05-05 15:09                           ` Andrea Arcangeli
2016-05-05 15:11                           ` Kirill A. Shutemov
2016-05-05 15:24                             ` Andrea Arcangeli
2016-05-06  7:29                               ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160502160042.GC24419@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).