From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753848AbcEBLPd (ORCPT <rfc822;w@1wt.eu>);
	Mon, 2 May 2016 07:15:33 -0400
Received: from mail-qk0-f174.google.com ([209.85.220.174]:35276 "EHLO
	mail-qk0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753619AbcEBLPZ (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 2 May 2016 07:15:25 -0400
Date: Mon, 2 May 2016 13:15:13 +0200
From: Jerome Glisse <j.glisse@gmail.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
        Alex Williamson <alex.williamson@redhat.com>,
        kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
        "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting
 redesign
Message-ID: <20160502111513.GA4079@gmail.com>
References: <20160428102051.17d1c728@t450s.home>
 <20160428181726.GA2847@node.shutemov.name>
 <20160428125808.29ad59e5@t450s.home>
 <20160428232127.GL11700@redhat.com>
 <20160429005106.GB2847@node.shutemov.name>
 <20160428204542.5f2053f7@ul30vt.home>
 <20160429070611.GA4990@node.shutemov.name>
 <20160429163444.GM11700@redhat.com>
 <20160502104119.GA23305@node.shutemov.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20160502104119.GA23305@node.shutemov.name>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 29, 2016 at 06:34:44PM +0200, Andrea Arcangeli wrote:
> > On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> > > Hm. I just woke up and haven't got any coffee yet, but I don't why my
> > > approach would be worse for performance. Both have the same algorithmic
> > > complexity.
> > 
> > Even before looking at the overall performance, I'm not sure your
> > patch is really fixing it all: you didn't touch reuse_swap_page which
> > is used by do_wp_page to know if it can call do_wp_page_reuse. Your
> > patch would still trigger a COW instead of calling do_wp_page_reuse,
> > but it would only happen if the page was pinned after the pmd split,
> > which is probably not what the testcase is triggering. My patch
> > instead fixed that too.
> > 
> > total_mapcount returns the wrong value for reuse_swap_page, which is
> > probably why you didn't try to use it there.
> > 
> > The main issue of my patch is that it has a performance downside that
> > is page_mapcount becomes expensive for all other usages, which is
> > better than breaking vfio but I couldn't use total_mapcount again
> > because it counts things wrong in reuse_swap_page.
> > 
> > Like I said there's room for optimizations so today I tried to
> > optimize more stuff...
> > 
> > From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
> >  during WP faults
> > 
> > This will provide fully accuracy to the mapcount calculation in the
> > write protect faults, so page pinning will not get broken by false
> > positive copy-on-writes.
> > 
> > total_mapcount() isn't the right calculation needed in
> > reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
> > is effectively the full accurate return value for page_mapcount() if
> > dealing with Transparent Hugepages, however we only use the
> > page_trans_huge_mapcount() during COW faults where it strictly needed,
> > due to its higher runtime cost.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/mm.h   |  5 +++++
> >  include/linux/swap.h |  3 +--
> >  mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
> >  mm/swapfile.c        |  5 +----
> >  4 files changed, 43 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8fb3604..c2026a1 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
> >  
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  int total_mapcount(struct page *page);
> > +int page_trans_huge_mapcount(struct page *page);
> >  #else
> >  static inline int total_mapcount(struct page *page)
> >  {
> >  	return page_mapcount(page);
> >  }
> > +static inline int page_trans_huge_mapcount(struct page *page)
> > +{
> > +	return page_mapcount(page);
> > +}
> >  #endif
> >  
> >  static inline struct page *virt_to_head_page(const void *x)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2f6478f..905bf8e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
> >  	return 0;
> >  }
> >  
> > -#define reuse_swap_page(page) \
> > -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> > +#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
> >  
> >  static inline int try_to_free_swap(struct page *page)
> >  {
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 06bce0f..6a6d9c0 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> >  	/*
> >  	 * We can only reuse the page if nobody else maps the huge page or it's
> > -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> > -	 * it's expensive.
> > -	 * The cheaper way is to check page_count() to be equal 1: every
> > -	 * mapcount takes page reference reference, so this way we can
> > -	 * guarantee, that the PMD is the only mapping.
> > -	 * This can give false negative if somebody pinned the page, but that's
> > -	 * fine.
> > +	 * part.
> >  	 */
> > -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> > +	if (page_trans_huge_mapcount(page) == 1) {
> >  		pmd_t entry;
> >  		entry = pmd_mkyoung(orig_pmd);
> >  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > @@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
> >  }
> >  
> >  /*
> > + * This calculates accurately how many mappings a transparent hugepage
> > + * has (unlike page_mapcount() which isn't fully accurate). This full
> > + * accuracy is primarily needed to know if copy-on-write faults can
> > + * takeover the page and change the mapping to read-write instead of
> > + * copying them. This is different from total_mapcount() too: we must
> > + * not count all mappings on the subpages individually, but instead we
> > + * must check the highest mapcount any one of the subpages has.
> > + *
> > + * It would be entirely safe and even more correct to replace
> > + * page_mapcount() with page_trans_huge_mapcount(), however we only
> > + * use page_trans_huge_mapcount() in the copy-on-write faults where we
> > + * need full accuracy to avoid breaking page pinning.
> > + */
> > +int page_trans_huge_mapcount(struct page *page)
> > +{
> > +	int i, ret;
> > +
> > +	VM_BUG_ON_PAGE(PageTail(page), page);
> > +
> > +	if (likely(!PageCompound(page)))
> > +		return atomic_read(&page->_mapcount) + 1;
> > +
> > +	ret = 0;
> > +	if (likely(!PageHuge(page))) {
> > +		for (i = 0; i < HPAGE_PMD_NR; i++)
> > +			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
> > +		if (PageDoubleMap(page))
> > +			ret -= 1;
> > +	}
> > +	ret += compound_mapcount(page);
> > +	return ret;
> > +}
> > +
> > +/*
> >   * This function splits huge page into normal pages. @page can point to any
> >   * subpage of huge page to split. Split doesn't change the position of @page.
> >   *
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 83874ec..984470a 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	if (unlikely(PageKsm(page)))
> >  		return 0;
> > -	/* The page is part of THP and cannot be reused */
> > -	if (PageTransCompound(page))
> > -		return 0;
> > -	count = page_mapcount(page);
> > +	count = page_trans_huge_mapcount(page);
> >  	if (count <= 1 && PageSwapCache(page)) {
> >  		count += page_swapcount(page);
> >  		if (count == 1 && !PageWriteback(page)) {
> 
> I don't think this would work correctly. Let's check one of callers:
> 
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> 		spinlock_t *ptl, pte_t orig_pte)
> 	__releases(ptl)
> {
> ...
> 		if (reuse_swap_page(old_page)) {
> 			/*
> 			 * The page is all ours.  Move it to our anon_vma so
> 			 * the rmap code will not search our parent or siblings.
> 			 * Protected against the rmap code by the page lock.
> 			 */
> 			page_move_anon_rmap(old_page, vma, address);
> 			unlock_page(old_page);
> 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> 					     orig_pte, old_page, 0, 0);
> 		}
> 
> The first thing to notice is that old_page can be a tail page here
> therefore page_move_anon_rmap() should be able to handle this after you
> patch, which it doesn't.
> 
> But I think there's a bigger problem.
> 
> Consider the following situation: after split_huge_pmd() we have
> pte-mapped THP, fork() comes and now the pages is shared between two
> processes. Child process munmap()s one half of the THP page, parent
> munmap()s the other half.
> 
> IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> subpages have mapcount exactly one. Fault in the child would trigger
> do_wp_page() and reuse_swap_page() returns true, which would lead to
> page_move_anon_rmap() tranferring the whole compound page to child's
> anon_vma. That's not correct.
> 
> We should at least avoid page_move_anon_rmap() for compound pages there.
> 
> 
> Other thing I would like to discuss is if there's a problem on vfio side.
> To me it looks like vfio expects guarantee from get_user_pages() which it
> doesn't provide: obtaining pin on the page doesn't guarantee that the page
> is going to remain mapped into userspace until the pin is gone.
> 
> Even with THP COW regressing fixed, vfio would stay fragile: any
> MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> broken.
> 

Well i don't think it is fair/accurate assessment of get_user_pages(), page
must remain mapped to same virtual address until pin is gone. I am ignoring
mremap() as it is a scient decision from userspace and while virtual address
change in that case, the pined page behind should move with the mapping.
Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
but this have been the case since dawn of time, so it is something expected.

If not vfio, then direct-io, have been expecting this kind of behavior for
long time, so i see this as part of get_user_pages() guarantee.

Concerning vfio, not providing this guarantee will break countless number of
workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
the guest kernel which presents it as memory. Now a device driver inside the
guest kernel need to get bus mapping for a given (guest) page, which from
host point of view means a mapping from anonymous page to bus mapping but
for guest to keep accessing the same page the anonymous mapping (ie a
specific virtual address on the host side) must keep pointing to the same
page. This have been the case with get_user_pages() until now, so whether
we like it or not we must keep that guarantee.

This kind of workload knows that they can't do mremap()/fork()/... and keep
that guarantee but they at expect existing guarantee and i don't think we
can break that.

Cheers,
Jérôme