From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752182AbcD2Qes (ORCPT <rfc822;w@1wt.eu>);
	Fri, 29 Apr 2016 12:34:48 -0400
Received: from mx1.redhat.com ([209.132.183.28]:59832 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751303AbcD2Qer (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 29 Apr 2016 12:34:47 -0400
Date: Fri, 29 Apr 2016 18:34:44 +0200
From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Alex Williamson <alex.williamson@redhat.com>,
        kirill.shutemov@linux.intel.com, linux-kernel@vger.kernel.org,
        "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting
 redesign
Message-ID: <20160429163444.GM11700@redhat.com>
References: <20160428102051.17d1c728@t450s.home>
 <20160428181726.GA2847@node.shutemov.name>
 <20160428125808.29ad59e5@t450s.home>
 <20160428232127.GL11700@redhat.com>
 <20160429005106.GB2847@node.shutemov.name>
 <20160428204542.5f2053f7@ul30vt.home>
 <20160429070611.GA4990@node.shutemov.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160429070611.GA4990@node.shutemov.name>
User-Agent: Mutt/1.6.0 (2016-04-01)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Fri, 29 Apr 2016 16:34:46 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> Hm. I just woke up and haven't got any coffee yet, but I don't why my
> approach would be worse for performance. Both have the same algorithmic
> complexity.

Even before looking at the overall performance, I'm not sure your
patch is really fixing it all: you didn't touch reuse_swap_page which
is used by do_wp_page to know if it can call do_wp_page_reuse. Your
patch would still trigger a COW instead of calling do_wp_page_reuse,
but it would only happen if the page was pinned after the pmd split,
which is probably not what the testcase is triggering. My patch
instead fixed that too.

total_mapcount returns the wrong value for reuse_swap_page, which is
probably why you didn't try to use it there.

The main issue of my patch is that it has a performance downside that
is page_mapcount becomes expensive for all other usages, which is
better than breaking vfio but I couldn't use total_mapcount again
because it counts things wrong in reuse_swap_page.

Like I said there's room for optimizations so today I tried to
optimize more stuff...

>>From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
is effectively the full accurate return value for page_mapcount() if
dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h   |  5 +++++
 include/linux/swap.h |  3 +--
 mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
 mm/swapfile.c        |  5 +----
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fb3604..c2026a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
+int page_trans_huge_mapcount(struct page *page);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
+static inline int page_trans_huge_mapcount(struct page *page)
+{
+	return page_mapcount(page);
+}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2f6478f..905bf8e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(!PageTransCompound(page) && page_mapcount(page) == 1)
+#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06bce0f..6a6d9c0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
 	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part. We can do it by checking page_mapcount() on each sub-page, but
-	 * it's expensive.
-	 * The cheaper way is to check page_count() to be equal 1: every
-	 * mapcount takes page reference reference, so this way we can
-	 * guarantee, that the PMD is the only mapping.
-	 * This can give false negative if somebody pinned the page, but that's
-	 * fine.
+	 * part.
 	 */
-	if (page_mapcount(page) == 1 && page_count(page) == 1) {
+	if (page_trans_huge_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
 }
 
 /*
+ * This calculates accurately how many mappings a transparent hugepage
+ * has (unlike page_mapcount() which isn't fully accurate). This full
+ * accuracy is primarily needed to know if copy-on-write faults can
+ * takeover the page and change the mapping to read-write instead of
+ * copying them. This is different from total_mapcount() too: we must
+ * not count all mappings on the subpages individually, but instead we
+ * must check the highest mapcount any one of the subpages has.
+ *
+ * It would be entirely safe and even more correct to replace
+ * page_mapcount() with page_trans_huge_mapcount(), however we only
+ * use page_trans_huge_mapcount() in the copy-on-write faults where we
+ * need full accuracy to avoid breaking page pinning.
+ */
+int page_trans_huge_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = 0;
+	if (likely(!PageHuge(page))) {
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
+		if (PageDoubleMap(page))
+			ret -= 1;
+	}
+	ret += compound_mapcount(page);
+	return ret;
+}
+
+/*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..984470a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
-	/* The page is part of THP and cannot be reused */
-	if (PageTransCompound(page))
-		return 0;
-	count = page_mapcount(page);
+	count = page_trans_huge_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
 		if (count == 1 && !PageWriteback(page)) {