Re: [BUG] vfio device assignment regression with THP ref counting redesign

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign
Date: Mon, 2 May 2016 19:00:42 +0300	[thread overview]
Message-ID: <20160502160042.GC24419@node.shutemov.name> (raw)
In-Reply-To: <20160502152307.GA12310@redhat.com>

On Mon, May 02, 2016 at 05:23:07PM +0200, Andrea Arcangeli wrote:
> On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > I don't think this would work correctly. Let's check one of callers:
> > 
> > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> > 		spinlock_t *ptl, pte_t orig_pte)
> > 	__releases(ptl)
> > {
> > ...
> > 		if (reuse_swap_page(old_page)) {
> > 			/*
> > 			 * The page is all ours.  Move it to our anon_vma so
> > 			 * the rmap code will not search our parent or siblings.
> > 			 * Protected against the rmap code by the page lock.
> > 			 */
> > 			page_move_anon_rmap(old_page, vma, address);
> > 			unlock_page(old_page);
> > 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> > 					     orig_pte, old_page, 0, 0);
> > 		}
> > 
> > The first thing to notice is that old_page can be a tail page here
> > therefore page_move_anon_rmap() should be able to handle this after you
> > patch, which it doesn't.
> 
> Agreed, that's an implementation error and easy to fix.
> 
> > But I think there's a bigger problem.
> > 
> > Consider the following situation: after split_huge_pmd() we have
> > pte-mapped THP, fork() comes and now the pages is shared between two
> > processes. Child process munmap()s one half of the THP page, parent
> > munmap()s the other half.
> > 
> > IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> > subpages have mapcount exactly one. Fault in the child would trigger
> > do_wp_page() and reuse_swap_page() returns true, which would lead to
> > page_move_anon_rmap() tranferring the whole compound page to child's
> > anon_vma. That's not correct.
> > 
> > We should at least avoid page_move_anon_rmap() for compound pages there.
> 
> So (compound_head() missing aside) the calculation I was doing is
> correct with regard to taking over the page and marking the pagetable
> read-write instead of triggering a COW and breaking the pinning, but
> it's not right only in terms of calling page_move_anon_rmap? The child
> or parent would then lose visibility on its ptes if the compound page
> is moved to the local vma->anon_vma.
> 
> The fix should be just to change page_trans_huge_mapcount() to return
> two refcounts, one "hard" for the pinning, and one "soft" for the rmap
> which will be the same as total_mapcount. The runtime cost will remain
> the same, so a fix can be easy for this one too.

Sounds correct, but code is going to be ugly :-/

> > Other thing I would like to discuss is if there's a problem on vfio side.
> > To me it looks like vfio expects guarantee from get_user_pages() which it
> > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > is going to remain mapped into userspace until the pin is gone.
> > 
> > Even with THP COW regressing fixed, vfio would stay fragile: any
> > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > broken.
> 
> vfio must run as root, it will take care of not doing such things, it
> just needs a way to prevent the page to be moved so it can DMA into it
> and mlock is not enough. This clearly has to be caused by a
> get_user_pages(write=0) or by a serialized fork/exec() while a
> longstanding page pin is being held (and to be safe fork/exec had to
> be serialized in a way that the parent process wouldn't write to the
> pinned page until after exec has run in the child, or it's already
> racy no matter what kernel).
> 
> I agree it's somewhat fragile, the problem here is that the THP
> refcounting change made it even weaker than it already was.

I didn't say we shouldn't fix the problem on THP side. But the attitude
"get_user_pages() would magically freeze page tables" worries me.

> Ideally the MMU notifier invalidate should be used instead of pinning
> the page, that would make it 100% robust and it wouldn't even pin the
> page at all.
> 
> However we can't send an MMU notifier invalidate to an IOMMU because
> next time the IOMMU non-present physical address is used it would kill
> the app. Some new IOMMU can raise an exception synchronously that we
> could use to implement a IOMMU secondary MMU page fault to make the
> MMU notifier model work with IOMMUs too, but that's not feasible with
> most IOMMU out there that raises an unrecoverable asynchronous
> exception instead and can't implement a proper "IOMMU page
> fault". Furthermore the speed of the invalidate may not be optimal
> with IOMMUs which would then be an added cost to pay for swapping and
> memory migration.
> 
> This is anyway a regression of the previous guarantees a pin would
> provide, if we want to bring back the old semantics of a page pin, I
> think fixing both places like I attempted to do (modulo two
> implementation bugs) is better than fixing only the THP case.

Agreed. I just didn't see the two-refcounts solution.

> If instead leave things as is, and we weaken the semantics of a page
> pin, the alternative to deal with the even weakened semantics inside
> the vfio code, is to use get_user_pages with write=1 forced and then
> it'll probably work also with current upstream (unless it's fork/exec,
> but I don't think it is, MADV_DONTFORK would be recommended anyway for
> usages like this with vfio if fork can ever run and there are threads
> in the parent, even O_DIRECT generates data corruption without
> MADV_DONTFORK in such conditions for similar reasons).

-- 
 Kirill A. Shutemov