From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754182AbcEBPX1 (ORCPT <rfc822;w@1wt.eu>);
	Mon, 2 May 2016 11:23:27 -0400
Received: from mx1.redhat.com ([209.132.183.28]:41419 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751623AbcEBPXK (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 2 May 2016 11:23:10 -0400
Date: Mon, 2 May 2016 17:23:07 +0200
From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Alex Williamson <alex.williamson@redhat.com>,
        kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
        "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting
 redesign
Message-ID: <20160502152307.GA12310@redhat.com>
References: <20160428102051.17d1c728@t450s.home>
 <20160428181726.GA2847@node.shutemov.name>
 <20160428125808.29ad59e5@t450s.home>
 <20160428232127.GL11700@redhat.com>
 <20160429005106.GB2847@node.shutemov.name>
 <20160428204542.5f2053f7@ul30vt.home>
 <20160429070611.GA4990@node.shutemov.name>
 <20160429163444.GM11700@redhat.com>
 <20160502104119.GA23305@node.shutemov.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160502104119.GA23305@node.shutemov.name>
User-Agent: Mutt/1.6.0 (2016-04-01)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Mon, 02 May 2016 15:23:09 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> I don't think this would work correctly. Let's check one of callers:
> 
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> 		spinlock_t *ptl, pte_t orig_pte)
> 	__releases(ptl)
> {
> ...
> 		if (reuse_swap_page(old_page)) {
> 			/*
> 			 * The page is all ours.  Move it to our anon_vma so
> 			 * the rmap code will not search our parent or siblings.
> 			 * Protected against the rmap code by the page lock.
> 			 */
> 			page_move_anon_rmap(old_page, vma, address);
> 			unlock_page(old_page);
> 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> 					     orig_pte, old_page, 0, 0);
> 		}
> 
> The first thing to notice is that old_page can be a tail page here
> therefore page_move_anon_rmap() should be able to handle this after you
> patch, which it doesn't.

Agreed, that's an implementation error and easy to fix.

> But I think there's a bigger problem.
> 
> Consider the following situation: after split_huge_pmd() we have
> pte-mapped THP, fork() comes and now the pages is shared between two
> processes. Child process munmap()s one half of the THP page, parent
> munmap()s the other half.
> 
> IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> subpages have mapcount exactly one. Fault in the child would trigger
> do_wp_page() and reuse_swap_page() returns true, which would lead to
> page_move_anon_rmap() tranferring the whole compound page to child's
> anon_vma. That's not correct.
> 
> We should at least avoid page_move_anon_rmap() for compound pages there.

So (compound_head() missing aside) the calculation I was doing is
correct with regard to taking over the page and marking the pagetable
read-write instead of triggering a COW and breaking the pinning, but
it's not right only in terms of calling page_move_anon_rmap? The child
or parent would then lose visibility on its ptes if the compound page
is moved to the local vma->anon_vma.

The fix should be just to change page_trans_huge_mapcount() to return
two refcounts, one "hard" for the pinning, and one "soft" for the rmap
which will be the same as total_mapcount. The runtime cost will remain
the same, so a fix can be easy for this one too.

> Other thing I would like to discuss is if there's a problem on vfio side.
> To me it looks like vfio expects guarantee from get_user_pages() which it
> doesn't provide: obtaining pin on the page doesn't guarantee that the page
> is going to remain mapped into userspace until the pin is gone.
> 
> Even with THP COW regressing fixed, vfio would stay fragile: any
> MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> broken.

vfio must run as root, it will take care of not doing such things, it
just needs a way to prevent the page to be moved so it can DMA into it
and mlock is not enough. This clearly has to be caused by a
get_user_pages(write=0) or by a serialized fork/exec() while a
longstanding page pin is being held (and to be safe fork/exec had to
be serialized in a way that the parent process wouldn't write to the
pinned page until after exec has run in the child, or it's already
racy no matter what kernel).

I agree it's somewhat fragile, the problem here is that the THP
refcounting change made it even weaker than it already was.

Ideally the MMU notifier invalidate should be used instead of pinning
the page, that would make it 100% robust and it wouldn't even pin the
page at all.

However we can't send an MMU notifier invalidate to an IOMMU because
next time the IOMMU non-present physical address is used it would kill
the app. Some new IOMMU can raise an exception synchronously that we
could use to implement a IOMMU secondary MMU page fault to make the
MMU notifier model work with IOMMUs too, but that's not feasible with
most IOMMU out there that raises an unrecoverable asynchronous
exception instead and can't implement a proper "IOMMU page
fault". Furthermore the speed of the invalidate may not be optimal
with IOMMUs which would then be an added cost to pay for swapping and
memory migration.

This is anyway a regression of the previous guarantees a pin would
provide, if we want to bring back the old semantics of a page pin, I
think fixing both places like I attempted to do (modulo two
implementation bugs) is better than fixing only the THP case.

If instead leave things as is, and we weaken the semantics of a page
pin, the alternative to deal with the even weakened semantics inside
the vfio code, is to use get_user_pages with write=1 forced and then
it'll probably work also with current upstream (unless it's fork/exec,
but I don't think it is, MADV_DONTFORK would be recommended anyway for
usages like this with vfio if fork can ever run and there are threads
in the parent, even O_DIRECT generates data corruption without
MADV_DONTFORK in such conditions for similar reasons).