linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2
@ 2013-11-15 17:47 Andrea Arcangeli
  2013-11-15 17:47 ` [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-15 17:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

Hi,

1/3 is a bugfix so it should be applied more urgently. 1/3 is not as
fast as the current upstream code in the hugetlbfs + directio extreme
8GB/sec benchmark (but 3/3 should fill the gap later). The code is
identical to the one I posted in v1 just rebased on upstream and was
developed in collaboration with Khalid who already tested it.

2/3 and 3/3 had very little testing yet, and they're incremental
optimization. 2/3 is minor and most certainly worth applying later.

3/3 instead complicates things a bit and adds more branches to the THP
fast paths, so it should only be applied if the benchmarks of
hugetlbfs + directio show that it is very worthwhile (that has not
been verified yet). If it's not worthwhile 3/3 should be dropped (and
the gap should be filled in some other way if the gap is not caused by
the _mapcount mangling as I guessed). Ideally this should bring even
more performance than current upstream code, as current upstream code
still increased the _mapcount in gup_fast by mistake, while this
eliminates the locked op on the tail page cacheline in gup_fast too
(which is required for correctness too).

As a side note: the _mapcount refcounting on tail pages is only needed
for THP as it is a fundamental information required for
split_huge_page_refcount to be able to distribute the head refcounts
during the split. And it is done on _mapcount instead of the _count,
because the _count would screwup badly with the get_page_unless_zero
speculative pagecache accesses.

Andrea Arcangeli (3):
  mm: hugetlbfs: fix hugetlbfs optimization
  mm: hugetlb: use get_page_foll in follow_hugetlb_page
  mm: tail page refcounting optimization for slab and hugetlbfs

 include/linux/mm.h |  30 ++++++++-
 mm/hugetlb.c       |  19 +++++-
 mm/internal.h      |   3 +-
 mm/swap.c          | 187 ++++++++++++++++++++++++++++++++++-------------------
 4 files changed, 170 insertions(+), 69 deletions(-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-15 17:47 [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Andrea Arcangeli
@ 2013-11-15 17:47 ` Andrea Arcangeli
  2013-11-19 23:11   ` Andrew Morton
  2013-11-15 17:47 ` [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page Andrea Arcangeli
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-15 17:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

The patch from commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 can
cause dereference of a dangling pointer if split_huge_page runs during
PageHuge() if there are updates to the tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is
running compound_head+compound_trans_head for THP when a single one is
needed in both cases.

The new code within the PageSlab() check doesn't need to verify that
the THP page size is never bigger than the smallest hugetlbfs page
size, to avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

BUG: Bad page state in process bash  pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000009 ffff880079cb5cc8 ffffffff81640e8b 0000000000000006
 ffffea000139b038 ffff880079cb5ce8 ffffffff8115bb15 00000000000002c1
 ffffea000139b038 ffff880079cb5d48 ffffffff8115bd83 ffff880079cb5de8
Call Trace:
 [<ffffffff81640e8b>] dump_stack+0x55/0x76
 [<ffffffff8115bb15>] bad_page+0xd5/0x130
 [<ffffffff8115bd83>] free_pages_prepare+0x213/0x280
 [<ffffffff8115df16>] __free_pages+0x36/0x80
 [<ffffffff8119b011>] update_and_free_page+0xc1/0xd0
 [<ffffffff8119b512>] free_pool_huge_page+0xc2/0xe0
 [<ffffffff8119b8cc>] set_max_huge_pages.part.58+0x14c/0x220
 [<ffffffff81308a8c>] ? _kstrtoull+0x2c/0x90
 [<ffffffff8119ba70>] nr_hugepages_store_common.isra.60+0xd0/0xf0
 [<ffffffff8119bac3>] nr_hugepages_store+0x13/0x20
 [<ffffffff812f763f>] kobj_attr_store+0xf/0x20
 [<ffffffff812354e9>] sysfs_write_file+0x189/0x1e0
 [<ffffffff811baff5>] vfs_write+0xc5/0x1f0
 [<ffffffff811bb505>] SyS_write+0x55/0xb0
 [<ffffffff81651712>] system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/hugetlb.h |   6 ++
 mm/hugetlb.c            |  17 ++++++
 mm/swap.c               | 143 ++++++++++++++++++++++++++++--------------------
 3 files changed, 106 insertions(+), 60 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index acd2010..d4f3dbf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,6 +31,7 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
+int PageHeadHuge(struct page *page_head);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -104,6 +105,11 @@ static inline int PageHuge(struct page *page)
 	return 0;
 }
 
+static inline int PageHeadHuge(struct page *page_head)
+{
+	return 0;
+}
+
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7d57af2..14737f8e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -736,6 +736,23 @@ int PageHuge(struct page *page)
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
+/*
+ * PageHeadHuge() only returns true for hugetlbfs head page, but not for
+ * normal or transparent huge pages.
+ */
+int PageHeadHuge(struct page *page_head)
+{
+	compound_page_dtor *dtor;
+
+	if (!PageHead(page_head))
+		return 0;
+
+	dtor = get_compound_page_dtor(page_head);
+
+	return dtor == free_huge_page;
+}
+EXPORT_SYMBOL_GPL(PageHeadHuge);
+
 pgoff_t __basepage_index(struct page *page)
 {
 	struct page *page_head = compound_head(page);
diff --git a/mm/swap.c b/mm/swap.c
index 7a9f80d..84b26aa 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -82,19 +82,6 @@ static void __put_compound_page(struct page *page)
 
 static void put_compound_page(struct page *page)
 {
-	/*
-	 * hugetlbfs pages cannot be split from under us.  If this is a
-	 * hugetlbfs page, check refcount on head page and release the page if
-	 * the refcount becomes zero.
-	 */
-	if (PageHuge(page)) {
-		page = compound_head(page);
-		if (put_page_testzero(page))
-			__put_compound_page(page);
-
-		return;
-	}
-
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_trans_head(page);
@@ -111,14 +98,31 @@ static void put_compound_page(struct page *page)
 			 * still hot on arches that do not support
 			 * this_cpu_cmpxchg_double().
 			 */
-			if (PageSlab(page_head)) {
-				if (PageTail(page)) {
+			if (PageSlab(page_head) || PageHeadHuge(page_head)) {
+				if (likely(PageTail(page))) {
+					/*
+					 * __split_huge_page_refcount
+					 * cannot race here.
+					 */
+					VM_BUG_ON(!PageHead(page_head));
+					atomic_dec(&page->_mapcount);
 					if (put_page_testzero(page_head))
 						VM_BUG_ON(1);
-
-					atomic_dec(&page->_mapcount);
-					goto skip_lock_tail;
+					if (put_page_testzero(page_head))
+						__put_compound_page(page_head);
+					return;
 				} else
+					/*
+					 * __split_huge_page_refcount
+					 * run before us, "page" was a
+					 * THP tail. The split
+					 * page_head has been freed
+					 * and reallocated as slab or
+					 * hugetlbfs page of smaller
+					 * order (only possible if
+					 * reallocated as slab on
+					 * x86).
+					 */
 					goto skip_lock;
 			}
 			/*
@@ -132,8 +136,27 @@ static void put_compound_page(struct page *page)
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 skip_lock:
-				if (put_page_testzero(page_head))
-					__put_single_page(page_head);
+				if (put_page_testzero(page_head)) {
+					/*
+					 * The head page may have been
+					 * freed and reallocated as a
+					 * compound page of smaller
+					 * order and then freed again.
+					 * All we know is that it
+					 * cannot have become: a THP
+					 * page, a compound page of
+					 * higher order, a tail page.
+					 * That is because we still
+					 * hold the refcount of the
+					 * split THP tail and
+					 * page_head was the THP head
+					 * before the split.
+					 */
+					if (PageHead(page_head))
+						__put_compound_page(page_head);
+					else
+						__put_single_page(page_head);
+				}
 out_put_single:
 				if (put_page_testzero(page))
 					__put_single_page(page);
@@ -155,7 +178,6 @@ out_put_single:
 			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 
-skip_lock_tail:
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
 					__put_compound_page(page_head);
@@ -198,51 +220,52 @@ bool __get_page_tail(struct page *page)
 	 * proper PT lock that already serializes against
 	 * split_huge_page().
 	 */
+	unsigned long flags;
 	bool got = false;
-	struct page *page_head;
-
-	/*
-	 * If this is a hugetlbfs page it cannot be split under us.  Simply
-	 * increment refcount for the head page.
-	 */
-	if (PageHuge(page)) {
-		page_head = compound_head(page);
-		atomic_inc(&page_head->_count);
-		got = true;
-	} else {
-		unsigned long flags;
+	struct page *page_head = compound_trans_head(page);
 
-		page_head = compound_trans_head(page);
-		if (likely(page != page_head &&
-					get_page_unless_zero(page_head))) {
-
-			/* Ref to put_compound_page() comment. */
-			if (PageSlab(page_head)) {
-				if (likely(PageTail(page))) {
-					__get_page_tail_foll(page, false);
-					return true;
-				} else {
-					put_page(page_head);
-					return false;
-				}
-			}
-
-			/*
-			 * page_head wasn't a dangling pointer but it
-			 * may not be a head page anymore by the time
-			 * we obtain the lock. That is ok as long as it
-			 * can't be freed from under us.
-			 */
-			flags = compound_lock_irqsave(page_head);
-			/* here __split_huge_page_refcount won't run anymore */
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/* Ref to put_compound_page() comment. */
+		if (PageSlab(page_head) || PageHeadHuge(page_head)) {
 			if (likely(PageTail(page))) {
+				/*
+				 * This is a hugetlbfs page or a slab
+				 * page. __split_huge_page_refcount
+				 * cannot race here.
+				 */
+				VM_BUG_ON(!PageHead(page_head));
 				__get_page_tail_foll(page, false);
-				got = true;
-			}
-			compound_unlock_irqrestore(page_head, flags);
-			if (unlikely(!got))
+				return true;
+			} else {
+				/*
+				 * __split_huge_page_refcount run
+				 * before us, "page" was a THP
+				 * tail. The split page_head has been
+				 * freed and reallocated as slab or
+				 * hugetlbfs page of smaller order
+				 * (only possible if reallocated as
+				 * slab on x86).
+				 */
 				put_page(page_head);
+				return false;
+			}
+		}
+
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = true;
 		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
 	}
 	return got;
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page
  2013-11-15 17:47 [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Andrea Arcangeli
  2013-11-15 17:47 ` [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
@ 2013-11-15 17:47 ` Andrea Arcangeli
  2013-11-19 21:27   ` Khalid Aziz
  2013-11-15 17:47 ` [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs Andrea Arcangeli
  2013-11-18 18:04 ` [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Khalid Aziz
  3 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-15 17:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

get_page_foll is more optimal and is always safe to use under the PT
lock. More so for hugetlbfs as there's no risk of race conditions with
split_huge_page regardless of the PT lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 14737f8e..f03e068 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3113,7 +3113,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			get_page_foll(pages[i]);
 		}
 
 		if (vmas)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs
  2013-11-15 17:47 [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Andrea Arcangeli
  2013-11-15 17:47 ` [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
  2013-11-15 17:47 ` [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page Andrea Arcangeli
@ 2013-11-15 17:47 ` Andrea Arcangeli
  2013-11-19 21:27   ` Khalid Aziz
  2013-11-19 23:14   ` Andrew Morton
  2013-11-18 18:04 ` [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Khalid Aziz
  3 siblings, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-15 17:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

This skips the _mapcount mangling for slab and hugetlbfs pages.

The main trouble in doing this is to guarantee that PageSlab and
PageHeadHuge remains constant for all get_page/put_page run on the
tail of slab or hugetlbfs compound pages. Otherwise if they're set
during get_page but not set during put_page, the _mapcount of the tail
page would underflow.

PageHeadHuge will remain true until the compound page is released and
enters the buddy allocator so it won't risk to change even if the tail
page is the last reference left on the page.

PG_slab instead is cleared before the slab frees the head page with
put_page, so if the tail pin is released after the slab freed the
page, we would have a problem. But in the slab case the tail pin
cannot be the last reference left on the page. This is because the
slab code is free to reuse the compound page after a
kfree/kmem_cache_free without having to check if there's any tail pin
left. In turn all tail pins must be always released while the head is
still pinned by the slab code and so we know PG_slab will be still set
too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/hugetlb.h |  6 ------
 include/linux/mm.h      | 30 +++++++++++++++++++++++++++++-
 mm/internal.h           |  3 ++-
 mm/swap.c               | 48 ++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d4f3dbf..acd2010 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,7 +31,6 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
-int PageHeadHuge(struct page *page_head);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -105,11 +104,6 @@ static inline int PageHuge(struct page *page)
 	return 0;
 }
 
-static inline int PageHeadHuge(struct page *page_head)
-{
-	return 0;
-}
-
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0548eb2..c40bb53 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -414,15 +414,43 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+extern int PageHeadHuge(struct page *page_head);
+#else /* CONFIG_HUGETLB_PAGE */
+static inline int PageHeadHuge(struct page *page_head)
+{
+	return 0;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
+/*
+ * This takes a head page as parameter and tells if the
+ * tail page reference counting can be skipped.
+ *
+ * For this to be safe, PageSlab and PageHeadHuge must remain true on
+ * any given page where they return true here, until all tail pins
+ * have been released.
+ */
+static inline bool compound_tail_refcounted(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	if (PageSlab(page) || PageHeadHuge(page))
+		return false;
+	else
+		return true;
+}
+
 static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
+	 * In turn no need of compound_trans_head here.
 	 */
 	VM_BUG_ON(page_mapcount(page) < 0);
 	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
+	if (compound_tail_refcounted(compound_head(page)))
+		atomic_inc(&page->_mapcount);
 }
 
 extern bool __get_page_tail(struct page *page);
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa..a85a3ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -51,7 +51,8 @@ static inline void __get_page_tail_foll(struct page *page,
 	VM_BUG_ON(page_mapcount(page) < 0);
 	if (get_page_head)
 		atomic_inc(&page->first_page->_count);
-	atomic_inc(&page->_mapcount);
+	if (compound_tail_refcounted(page->first_page))
+		atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index 84b26aa..51bae1d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -91,12 +91,15 @@ static void put_compound_page(struct page *page)
 			unsigned long flags;
 
 			/*
-			 * THP can not break up slab pages so avoid taking
-			 * compound_lock().  Slab performs non-atomic bit ops
-			 * on page->flags for better performance.  In particular
-			 * slab_unlock() in slub used to be a hot path.  It is
-			 * still hot on arches that do not support
-			 * this_cpu_cmpxchg_double().
+			 * THP can not break up slab pages or
+			 * hugetlbfs pages so avoid taking
+			 * compound_lock() and skip the tail page
+			 * refcounting (in _mapcount) too. Slab
+			 * performs non-atomic bit ops on page->flags
+			 * for better performance. In particular
+			 * slab_unlock() in slub used to be a hot
+			 * path. It is still hot on arches that do not
+			 * support this_cpu_cmpxchg_double().
 			 */
 			if (PageSlab(page_head) || PageHeadHuge(page_head)) {
 				if (likely(PageTail(page))) {
@@ -105,11 +108,40 @@ static void put_compound_page(struct page *page)
 					 * cannot race here.
 					 */
 					VM_BUG_ON(!PageHead(page_head));
-					atomic_dec(&page->_mapcount);
+					VM_BUG_ON(atomic_read(&page->_mapcount)
+						  != -1);
 					if (put_page_testzero(page_head))
 						VM_BUG_ON(1);
-					if (put_page_testzero(page_head))
+					if (put_page_testzero(page_head)) {
+						/*
+						 * If this is the tail
+						 * of a a slab
+						 * compound page, the
+						 * tail pin must not
+						 * be the last
+						 * reference held on
+						 * the page, because
+						 * the PG_slab cannot
+						 * be cleared before
+						 * all tail pins
+						 * (which skips the
+						 * _mapcount tail
+						 * refcounting) have
+						 * been released. For
+						 * hugetlbfs the tail
+						 * pin may be the last
+						 * reference on the
+						 * page instead,
+						 * because
+						 * PageHeadHuge will
+						 * not go away until
+						 * the compound page
+						 * enters the buddy
+						 * allocator.
+						 */
+						VM_BUG_ON(PageSlab(page_head));
 						__put_compound_page(page_head);
+					}
 					return;
 				} else
 					/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2
  2013-11-15 17:47 [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2013-11-15 17:47 ` [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs Andrea Arcangeli
@ 2013-11-18 18:04 ` Khalid Aziz
  2013-11-19 20:27   ` Khalid Aziz
  3 siblings, 1 reply; 13+ messages in thread
From: Khalid Aziz @ 2013-11-18 18:04 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Pravin Shelar, Greg Kroah-Hartman,
	Ben Hutchings, Christoph Lameter, Johannes Weiner, Mel Gorman,
	Rik van Riel, Andi Kleen, Minchan Kim, Linus Torvalds

On 11/15/2013 10:47 AM, Andrea Arcangeli wrote:
> Hi,
>
> 1/3 is a bugfix so it should be applied more urgently. 1/3 is not as
> fast as the current upstream code in the hugetlbfs + directio extreme
> 8GB/sec benchmark (but 3/3 should fill the gap later). The code is
> identical to the one I posted in v1 just rebased on upstream and was
> developed in collaboration with Khalid who already tested it.
>
> 2/3 and 3/3 had very little testing yet, and they're incremental
> optimization. 2/3 is minor and most certainly worth applying later.
>
> 3/3 instead complicates things a bit and adds more branches to the THP
> fast paths, so it should only be applied if the benchmarks of
> hugetlbfs + directio show that it is very worthwhile (that has not
> been verified yet). If it's not worthwhile 3/3 should be dropped (and
> the gap should be filled in some other way if the gap is not caused by
> the _mapcount mangling as I guessed). Ideally this should bring even
> more performance than current upstream code, as current upstream code
> still increased the _mapcount in gup_fast by mistake, while this
> eliminates the locked op on the tail page cacheline in gup_fast too
> (which is required for correctness too).

Hi Andrea,

I ran directio benchmark and here are the performance numbers (MBytes/sec):

Block size        3.12         3.12+patch 1      3.12+patch 1,2,3
----------        ----         ------------      ----------------
1M                8467           8114              7648
64K               4049           4043              4175

Performance numbers with 64K reads look good but there is further 
deterioration with 1M reads.

--
Khalid

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2
  2013-11-18 18:04 ` [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Khalid Aziz
@ 2013-11-19 20:27   ` Khalid Aziz
  2013-11-19 22:52     ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Khalid Aziz @ 2013-11-19 20:27 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Pravin Shelar, Greg Kroah-Hartman,
	Ben Hutchings, Christoph Lameter, Johannes Weiner, Mel Gorman,
	Rik van Riel, Andi Kleen, Minchan Kim, Linus Torvalds

On 11/18/2013 11:04 AM, Khalid Aziz wrote:
> On 11/15/2013 10:47 AM, Andrea Arcangeli wrote:
>> Hi,
>>
>> 1/3 is a bugfix so it should be applied more urgently. 1/3 is not as
>> fast as the current upstream code in the hugetlbfs + directio extreme
>> 8GB/sec benchmark (but 3/3 should fill the gap later). The code is
>> identical to the one I posted in v1 just rebased on upstream and was
>> developed in collaboration with Khalid who already tested it.
>>
>> 2/3 and 3/3 had very little testing yet, and they're incremental
>> optimization. 2/3 is minor and most certainly worth applying later.
>>
>> 3/3 instead complicates things a bit and adds more branches to the THP
>> fast paths, so it should only be applied if the benchmarks of
>> hugetlbfs + directio show that it is very worthwhile (that has not
>> been verified yet). If it's not worthwhile 3/3 should be dropped (and
>> the gap should be filled in some other way if the gap is not caused by
>> the _mapcount mangling as I guessed). Ideally this should bring even
>> more performance than current upstream code, as current upstream code
>> still increased the _mapcount in gup_fast by mistake, while this
>> eliminates the locked op on the tail page cacheline in gup_fast too
>> (which is required for correctness too).
>
> Hi Andrea,
>
> I ran directio benchmark and here are the performance numbers (MBytes/sec):
>
> Block size        3.12         3.12+patch 1      3.12+patch 1,2,3
> ----------        ----         ------------      ----------------
> 1M                8467           8114              7648
> 64K               4049           4043              4175
>
> Performance numbers with 64K reads look good but there is further
> deterioration with 1M reads.
>
> --
> Khalid

Hi Andrea,

I found that a background task running on my test server had influenced 
the performance numbers for 1M reads. I cleaned that problem up and 
re-ran the test. I am seeing 8456 MB/sec with all three patches applied, 
so 1M number is looking good as well.

--
Khalid

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page
  2013-11-15 17:47 ` [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page Andrea Arcangeli
@ 2013-11-19 21:27   ` Khalid Aziz
  0 siblings, 0 replies; 13+ messages in thread
From: Khalid Aziz @ 2013-11-19 21:27 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Pravin Shelar, Greg Kroah-Hartman,
	Ben Hutchings, Christoph Lameter, Johannes Weiner, Mel Gorman,
	Rik van Riel, Andi Kleen, Minchan Kim, Linus Torvalds

On 11/15/2013 10:47 AM, Andrea Arcangeli wrote:
> get_page_foll is more optimal and is always safe to use under the PT
> lock. More so for hugetlbfs as there's no risk of race conditions with
> split_huge_page regardless of the PT lock.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>

--
Khalid


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs
  2013-11-15 17:47 ` [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs Andrea Arcangeli
@ 2013-11-19 21:27   ` Khalid Aziz
  2013-11-19 23:14   ` Andrew Morton
  1 sibling, 0 replies; 13+ messages in thread
From: Khalid Aziz @ 2013-11-19 21:27 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Pravin Shelar, Greg Kroah-Hartman,
	Ben Hutchings, Christoph Lameter, Johannes Weiner, Mel Gorman,
	Rik van Riel, Andi Kleen, Minchan Kim, Linus Torvalds

On 11/15/2013 10:47 AM, Andrea Arcangeli wrote:
> This skips the _mapcount mangling for slab and hugetlbfs pages.
>
> The main trouble in doing this is to guarantee that PageSlab and
> PageHeadHuge remains constant for all get_page/put_page run on the
> tail of slab or hugetlbfs compound pages. Otherwise if they're set
> during get_page but not set during put_page, the _mapcount of the tail
> page would underflow.
>
> PageHeadHuge will remain true until the compound page is released and
> enters the buddy allocator so it won't risk to change even if the tail
> page is the last reference left on the page.
>
> PG_slab instead is cleared before the slab frees the head page with
> put_page, so if the tail pin is released after the slab freed the
> page, we would have a problem. But in the slab case the tail pin
> cannot be the last reference left on the page. This is because the
> slab code is free to reuse the compound page after a
> kfree/kmem_cache_free without having to check if there's any tail pin
> left. In turn all tail pins must be always released while the head is
> still pinned by the slab code and so we know PG_slab will be still set
> too.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>

--
Khalid

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2
  2013-11-19 20:27   ` Khalid Aziz
@ 2013-11-19 22:52     ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-19 22:52 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, linux-mm, linux-kernel, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

5~Hi Khalid,

On Tue, Nov 19, 2013 at 01:27:22PM -0700, Khalid Aziz wrote:
> > Block size        3.12         3.12+patch 1      3.12+patch 1,2,3
> > ----------        ----         ------------      ----------------
> > 1M                8467           8114              7648
> > 64K               4049           4043              4175
> >
> > Performance numbers with 64K reads look good but there is further
> > deterioration with 1M reads.
> >
> > --
> > Khalid
> 
> Hi Andrea,
> 
> I found that a background task running on my test server had influenced 
> the performance numbers for 1M reads. I cleaned that problem up and 
> re-ran the test. I am seeing 8456 MB/sec with all three patches applied, 
> so 1M number is looking good as well.

Good news thanks!

1/3 should go in -mm I think as it fixes many problems.

The rest can be applied with less priority and is not as urgent.

I've also tried to optimize it further in the meantime as I thought it
wasn't fully ok yet. So I could send another patchset. I haven't
changed 1/3 and I don't plan changing it. And I kept 3/3 at the end as
it's the one with a bit more of complexity than the rest.

I basically removed a few more atomic ops for each put_page/get_page
for both hugetlbfs and slab, and the important thing is they're zero
cost changes for the non-hugetlbfs/slab fast paths so they're probably
worth it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-15 17:47 ` [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
@ 2013-11-19 23:11   ` Andrew Morton
  2013-11-20  0:26     ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2013-11-19 23:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

On Fri, 15 Nov 2013 18:47:46 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:

> The patch from commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 can
> cause dereference of a dangling pointer if split_huge_page runs during
> PageHuge() if there are updates to the tail_page->private field.
> 
> Also it is repeating compound_head twice for hugetlbfs and it is
> running compound_head+compound_trans_head for THP when a single one is
> needed in both cases.
> 
> The new code within the PageSlab() check doesn't need to verify that
> the THP page size is never bigger than the smallest hugetlbfs page
> size, to avoid memory corruption.
> 
> A longstanding theoretical race condition was found while fixing the
> above (see the change right after the skip_unlock label, that is
> relevant for the compound_lock path too).
> 
> By re-establishing the _mapcount tail refcounting for all compound
> pages, this also fixes the below problem:
> 
> echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 
> ...
>
> +/*
> + * PageHeadHuge() only returns true for hugetlbfs head page, but not for
> + * normal or transparent huge pages.
> + */
> +int PageHeadHuge(struct page *page_head)
> +{
> +	compound_page_dtor *dtor;
> +
> +	if (!PageHead(page_head))
> +		return 0;
> +
> +	dtor = get_compound_page_dtor(page_head);
> +
> +	return dtor == free_huge_page;
> +}
> +EXPORT_SYMBOL_GPL(PageHeadHuge);

This is all rather verbose.  How about we do this?

--- a/mm/hugetlb.c~mm-hugetlbc-simplify-pageheadhuge-and-pagehuge
+++ a/mm/hugetlb.c
@@ -690,15 +690,11 @@ static void prep_compound_gigantic_page(
  */
 int PageHuge(struct page *page)
 {
-	compound_page_dtor *dtor;
-
 	if (!PageCompound(page))
 		return 0;
 
 	page = compound_head(page);
-	dtor = get_compound_page_dtor(page);
-
-	return dtor == free_huge_page;
+	return get_compound_page_dtor(page) == free_huge_page;
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
@@ -708,14 +704,10 @@ EXPORT_SYMBOL_GPL(PageHuge);
  */
 int PageHeadHuge(struct page *page_head)
 {
-	compound_page_dtor *dtor;
-
 	if (!PageHead(page_head))
 		return 0;
 
-	dtor = get_compound_page_dtor(page_head);
-
-	return dtor == free_huge_page;
+	return get_compound_page_dtor(page_head) == free_huge_page;
 }
 EXPORT_SYMBOL_GPL(PageHeadHuge);
 
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -82,19 +82,6 @@ static void __put_compound_page(struct page *page)
>  
>  static void put_compound_page(struct page *page)

This function has become quite crazy.  I sat down to refamiliarize but
immediately failed.

: static void put_compound_page(struct page *page)
: {
: 	if (unlikely(PageTail(page))) {
:	...
: 	} else if (put_page_testzero(page)) {
: 		if (PageHead(page))

How can a page be both PageTail() and PageHead()?

: 			__put_compound_page(page);
: 		else
: 			__put_single_page(page);
: 	}
: }
: 
: 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs
  2013-11-15 17:47 ` [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs Andrea Arcangeli
  2013-11-19 21:27   ` Khalid Aziz
@ 2013-11-19 23:14   ` Andrew Morton
  2013-11-20  0:20     ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2013-11-19 23:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

On Fri, 15 Nov 2013 18:47:48 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:

> This skips the _mapcount mangling for slab and hugetlbfs pages.
> 
> The main trouble in doing this is to guarantee that PageSlab and
> PageHeadHuge remains constant for all get_page/put_page run on the
> tail of slab or hugetlbfs compound pages. Otherwise if they're set
> during get_page but not set during put_page, the _mapcount of the tail
> page would underflow.
> 
> PageHeadHuge will remain true until the compound page is released and
> enters the buddy allocator so it won't risk to change even if the tail
> page is the last reference left on the page.
> 
> PG_slab instead is cleared before the slab frees the head page with
> put_page, so if the tail pin is released after the slab freed the
> page, we would have a problem. But in the slab case the tail pin
> cannot be the last reference left on the page. This is because the
> slab code is free to reuse the compound page after a
> kfree/kmem_cache_free without having to check if there's any tail pin
> left. In turn all tail pins must be always released while the head is
> still pinned by the slab code and so we know PG_slab will be still set
> too.
> 
> ...
>
> +					if (put_page_testzero(page_head)) {
> +						/*
> +						 * If this is the tail
> +						 * of a a slab
> +						 * compound page, the
> +						 * tail pin must not
> +						 * be the last
> +						 * reference held on
> +						 * the page, because
> +						 * the PG_slab cannot
> +						 * be cleared before
> +						 * all tail pins
> +						 * (which skips the
> +						 * _mapcount tail
> +						 * refcounting) have
> +						 * been released. For
> +						 * hugetlbfs the tail
> +						 * pin may be the last
> +						 * reference on the
> +						 * page instead,
> +						 * because
> +						 * PageHeadHuge will
> +						 * not go away until
> +						 * the compound page
> +						 * enters the buddy
> +						 * allocator.
> +						 */
> +						VM_BUG_ON(PageSlab(page_head));

This looks like it was attacked by Lindent.  How's this look?


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/swap.c: reorganize put_compound_page()

Tweak it so save a tab stop, make code layout slightly less nutty.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |  238 +++++++++++++++++++++++-----------------------------
 1 file changed, 107 insertions(+), 131 deletions(-)

diff -puN mm/swap.c~mm-swapc-reorganize-put_compound_page mm/swap.c
--- a/mm/swap.c~mm-swapc-reorganize-put_compound_page
+++ a/mm/swap.c
@@ -82,150 +82,126 @@ static void __put_compound_page(struct p
 
 static void put_compound_page(struct page *page)
 {
-	if (unlikely(PageTail(page))) {
-		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = compound_trans_head(page);
-
-		if (likely(page != page_head &&
-			   get_page_unless_zero(page_head))) {
-			unsigned long flags;
-
-			/*
-			 * THP can not break up slab pages or
-			 * hugetlbfs pages so avoid taking
-			 * compound_lock() and skip the tail page
-			 * refcounting (in _mapcount) too. Slab
-			 * performs non-atomic bit ops on page->flags
-			 * for better performance. In particular
-			 * slab_unlock() in slub used to be a hot
-			 * path. It is still hot on arches that do not
-			 * support this_cpu_cmpxchg_double().
-			 */
-			if (PageSlab(page_head) || PageHeadHuge(page_head)) {
-				if (likely(PageTail(page))) {
-					/*
-					 * __split_huge_page_refcount
-					 * cannot race here.
-					 */
-					VM_BUG_ON(!PageHead(page_head));
-					VM_BUG_ON(atomic_read(&page->_mapcount)
-						  != -1);
-					if (put_page_testzero(page_head))
-						VM_BUG_ON(1);
-					if (put_page_testzero(page_head)) {
-						/*
-						 * If this is the tail
-						 * of a a slab
-						 * compound page, the
-						 * tail pin must not
-						 * be the last
-						 * reference held on
-						 * the page, because
-						 * the PG_slab cannot
-						 * be cleared before
-						 * all tail pins
-						 * (which skips the
-						 * _mapcount tail
-						 * refcounting) have
-						 * been released. For
-						 * hugetlbfs the tail
-						 * pin may be the last
-						 * reference on the
-						 * page instead,
-						 * because
-						 * PageHeadHuge will
-						 * not go away until
-						 * the compound page
-						 * enters the buddy
-						 * allocator.
-						 */
-						VM_BUG_ON(PageSlab(page_head));
-						__put_compound_page(page_head);
-					}
-					return;
-				} else
-					/*
-					 * __split_huge_page_refcount
-					 * run before us, "page" was a
-					 * THP tail. The split
-					 * page_head has been freed
-					 * and reallocated as slab or
-					 * hugetlbfs page of smaller
-					 * order (only possible if
-					 * reallocated as slab on
-					 * x86).
-					 */
-					goto skip_lock;
-			}
-			/*
-			 * page_head wasn't a dangling pointer but it
-			 * may not be a head page anymore by the time
-			 * we obtain the lock. That is ok as long as it
-			 * can't be freed from under us.
-			 */
-			flags = compound_lock_irqsave(page_head);
-			if (unlikely(!PageTail(page))) {
-				/* __split_huge_page_refcount run before us */
-				compound_unlock_irqrestore(page_head, flags);
-skip_lock:
+	struct page *page_head;
+
+	if (likely(PageTail(page))) {
+		if (put_page_testzero(page)) {
+			if (PageHead(page))
+				__put_compound_page(page);
+			else
+				__put_single_page(page);
+		}
+		return;
+	}
+
+	/* __split_huge_page_refcount can run under us */
+	page_head = compound_trans_head(page);
+
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		unsigned long flags;
+
+		/*
+		 * THP can not break up slab pages or hugetlbfs pages so avoid
+		 * taking compound_lock() and skip the tail page refcounting
+		 * (in _mapcount) too. Slab performs non-atomic bit ops on
+		 * page->flags for better performance. In particular
+		 * slab_unlock() in slub used to be a hot path. It is still hot
+		 * on arches that do not support this_cpu_cmpxchg_double().
+		 */
+		if (PageSlab(page_head) || PageHeadHuge(page_head)) {
+			if (likely(PageTail(page))) {
+				/*
+				 * __split_huge_page_refcount cannot race here.
+				 */
+				VM_BUG_ON(!PageHead(page_head));
+				VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+				if (put_page_testzero(page_head))
+					VM_BUG_ON(1);
 				if (put_page_testzero(page_head)) {
 					/*
-					 * The head page may have been
-					 * freed and reallocated as a
-					 * compound page of smaller
-					 * order and then freed again.
-					 * All we know is that it
-					 * cannot have become: a THP
-					 * page, a compound page of
-					 * higher order, a tail page.
-					 * That is because we still
-					 * hold the refcount of the
-					 * split THP tail and
-					 * page_head was the THP head
-					 * before the split.
+					 * If this is the tail of a slab
+					 * compound page, the tail pin must not
+					 * be the last reference held on the
+					 * page, because the PG_slab cannot be
+					 * cleared before all tail pins (which
+					 * skips the _mapcount tail refcounting)
+					 * have been released. For hugetlbfs the
+					 * tail pin may be the last reference on
+					 * the page instead, because
+					 * PageHeadHuge will not go away until
+					 * the compound page enters the buddy
+					 * allocator.
 					 */
-					if (PageHead(page_head))
-						__put_compound_page(page_head);
-					else
-						__put_single_page(page_head);
+					VM_BUG_ON(PageSlab(page_head));
+					__put_compound_page(page_head);
 				}
-out_put_single:
-				if (put_page_testzero(page))
-					__put_single_page(page);
 				return;
-			}
-			VM_BUG_ON(page_head != page->first_page);
-			/*
-			 * We can release the refcount taken by
-			 * get_page_unless_zero() now that
-			 * __split_huge_page_refcount() is blocked on
-			 * the compound_lock.
-			 */
-			if (put_page_testzero(page_head))
-				VM_BUG_ON(1);
-			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(page_mapcount(page) <= 0);
-			atomic_dec(&page->_mapcount);
-			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
-			VM_BUG_ON(atomic_read(&page->_count) != 0);
+			} else
+				/*
+				 * __split_huge_page_refcount run before us,
+				 * "page" was a THP tail. The split page_head
+				 * has been freed and reallocated as slab or
+				 * hugetlbfs page of smaller order (only
+				 * possible if reallocated as slab on x86).
+				 */
+				goto skip_lock;
+		}
+		/*
+		 * page_head wasn't a dangling pointer but it may not be a head
+		 * page anymore by the time we obtain the lock. That is ok as
+		 * long as it can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		if (unlikely(!PageTail(page))) {
+			/* __split_huge_page_refcount run before us */
 			compound_unlock_irqrestore(page_head, flags);
-
+skip_lock:
 			if (put_page_testzero(page_head)) {
+				/*
+				 * The head page may have been freed and
+				 * reallocated as a compound page of smaller
+				 * order and then freed again.  All we know is
+				 * that it cannot have become: a THP page, a
+				 * compound page of higher order, a tail page.
+				 * That is because we still hold the refcount of
+				 * the split THP tail and page_head was the THP
+				 * head before the split.
+				 */
 				if (PageHead(page_head))
 					__put_compound_page(page_head);
 				else
 					__put_single_page(page_head);
 			}
-		} else {
-			/* page_head is a dangling pointer */
-			VM_BUG_ON(PageTail(page));
-			goto out_put_single;
+out_put_single:
+			if (put_page_testzero(page))
+				__put_single_page(page);
+			return;
+		}
+		VM_BUG_ON(page_head != page->first_page);
+		/*
+		 * We can release the refcount taken by get_page_unless_zero()
+		 * now that __split_huge_page_refcount() is blocked on the
+		 * compound_lock.
+		 */
+		if (put_page_testzero(page_head))
+			VM_BUG_ON(1);
+		/* __split_huge_page_refcount will wait now */
+		VM_BUG_ON(page_mapcount(page) <= 0);
+		atomic_dec(&page->_mapcount);
+		VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+		VM_BUG_ON(atomic_read(&page->_count) != 0);
+		compound_unlock_irqrestore(page_head, flags);
+
+		if (put_page_testzero(page_head)) {
+			if (PageHead(page_head))
+				__put_compound_page(page_head);
+			else
+				__put_single_page(page_head);
 		}
-	} else if (put_page_testzero(page)) {
-		if (PageHead(page))
-			__put_compound_page(page);
-		else
-			__put_single_page(page);
+	} else {
+		/* page_head is a dangling pointer */
+		VM_BUG_ON(PageTail(page));
+		goto out_put_single;
 	}
 }
 
_


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs
  2013-11-19 23:14   ` Andrew Morton
@ 2013-11-20  0:20     ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-20  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

Hi Andrew!

On Tue, Nov 19, 2013 at 03:14:16PM -0800, Andrew Morton wrote:
> This looks like it was attacked by Lindent.  How's this look?

Indeed, I didn't try to optimize as 3/3 was mostly a test patch so
far.

> +	struct page *page_head;
> +
> +	if (likely(PageTail(page))) {

!PageTail here.

The current version would look like below, which is less obviously
horrible than before :). We may also consider to keep your indent
cleanup incremental as it's much easier to review the actual change of
code logic the below way.

>From 8a7bd3b25f7ee69b53a1a9e681790333df9504a6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu, 14 Nov 2013 20:17:38 +0100
Subject: [PATCH] mm: tail page refcounting optimization for slab and hugetlbfs

This skips the _mapcount mangling for slab and hugetlbfs pages.

The main trouble in doing this is to guarantee that PageSlab and
PageHeadHuge remains constant for all get_page/put_page run on the
tail of slab or hugetlbfs compound pages. Otherwise if they're set
during get_page but not set during put_page, the _mapcount of the tail
page would underflow.

PageHeadHuge will remain true until the compound page is released and
enters the buddy allocator so it won't risk to change even if the tail
page is the last reference left on the page.

PG_slab instead is cleared before the slab frees the head page with
put_page, so if the tail pin is released after the slab freed the
page, we would have a problem. But in the slab case the tail pin
cannot be the last reference left on the page. This is because the
slab code is free to reuse the compound page after a
kfree/kmem_cache_free without having to check if there's any tail pin
left. In turn all tail pins must be always released while the head is
still pinned by the slab code and so we know PG_slab will be still set
too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/hugetlb.h |  6 ------
 include/linux/mm.h      | 32 +++++++++++++++++++++++++++++++-
 mm/internal.h           |  3 ++-
 mm/swap.c               | 33 +++++++++++++++++++++++++++------
 4 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d4f3dbf..acd2010 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -31,7 +31,6 @@ struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 int PageHuge(struct page *page);
-int PageHeadHuge(struct page *page_head);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
@@ -105,11 +104,6 @@ static inline int PageHuge(struct page *page)
 	return 0;
 }
 
-static inline int PageHeadHuge(struct page *page_head)
-{
-	return 0;
-}
-
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0548eb2..6b20b34 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -414,15 +414,45 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+extern int PageHeadHuge(struct page *page_head);
+#else /* CONFIG_HUGETLB_PAGE */
+static inline int PageHeadHuge(struct page *page_head)
+{
+	return 0;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
+static inline bool __compound_tail_refcounted(struct page *page)
+{
+	return !PageSlab(page) && !PageHeadHuge(page);
+}
+
+/*
+ * This takes a head page as parameter and tells if the
+ * tail page reference counting can be skipped.
+ *
+ * For this to be safe, PageSlab and PageHeadHuge must remain true on
+ * any given page where they return true here, until all tail pins
+ * have been released.
+ */
+static inline bool compound_tail_refcounted(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	return __compound_tail_refcounted(page);
+}
+
 static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
+	 * In turn no need of compound_trans_head here.
 	 */
 	VM_BUG_ON(page_mapcount(page) < 0);
 	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
+	if (compound_tail_refcounted(compound_head(page)))
+		atomic_inc(&page->_mapcount);
 }
 
 extern bool __get_page_tail(struct page *page);
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa..a85a3ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -51,7 +51,8 @@ static inline void __get_page_tail_foll(struct page *page,
 	VM_BUG_ON(page_mapcount(page) < 0);
 	if (get_page_head)
 		atomic_inc(&page->first_page->_count);
-	atomic_inc(&page->_mapcount);
+	if (compound_tail_refcounted(page->first_page))
+		atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index dbf5427..b4c49bf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -88,8 +88,9 @@ static void put_compound_page(struct page *page)
 
 		/*
 		 * THP can not break up slab pages so avoid taking
-		 * compound_lock(). Slab performs non-atomic bit ops
-		 * on page->flags for better performance. In
+		 * compound_lock() and skip the tail page refcounting
+		 * (in _mapcount) too. Slab performs non-atomic bit
+		 * ops on page->flags for better performance. In
 		 * particular slab_unlock() in slub used to be a hot
 		 * path. It is still hot on arches that do not support
 		 * this_cpu_cmpxchg_double().
@@ -102,7 +103,7 @@ static void put_compound_page(struct page *page)
 		 * PageTail clear after smp_rmb() and we'll threat it
 		 * as a single page.
 		 */
-		if (PageSlab(page_head) || PageHeadHuge(page_head)) {
+		if (!__compound_tail_refcounted(page_head)) {
 			/*
 			 * If "page" is a THP tail, we must read the tail page
 			 * flags after the head page flags. The
@@ -117,10 +118,30 @@ static void put_compound_page(struct page *page)
 				 * cannot race here.
 				 */
 				VM_BUG_ON(!PageHead(page_head));
-				VM_BUG_ON(page_mapcount(page) <= 0);
-				atomic_dec(&page->_mapcount);
-				if (put_page_testzero(page_head))
+				VM_BUG_ON(page_mapcount(page) != 0);
+				if (put_page_testzero(page_head)) {
+					/*
+					 * If this is the tail of a
+					 * slab compound page, the
+					 * tail pin must not be the
+					 * last reference held on the
+					 * page, because the PG_slab
+					 * cannot be cleared before
+					 * all tail pins (which skips
+					 * the _mapcount tail
+					 * refcounting) have been
+					 * released. For hugetlbfs the
+					 * tail pin may be the last
+					 * reference on the page
+					 * instead, because
+					 * PageHeadHuge will not go
+					 * away until the compound
+					 * page enters the buddy
+					 * allocator.
+					 */
+					VM_BUG_ON(PageSlab(page_head));
 					__put_compound_page(page_head);
+				}
 				return;
 			} else
 				/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-19 23:11   ` Andrew Morton
@ 2013-11-20  0:26     ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2013-11-20  0:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Khalid Aziz, Pravin Shelar,
	Greg Kroah-Hartman, Ben Hutchings, Christoph Lameter,
	Johannes Weiner, Mel Gorman, Rik van Riel, Andi Kleen,
	Minchan Kim, Linus Torvalds

On Tue, Nov 19, 2013 at 03:11:46PM -0800, Andrew Morton wrote:
> This is all rather verbose.  How about we do this?
> 
> --- a/mm/hugetlb.c~mm-hugetlbc-simplify-pageheadhuge-and-pagehuge
> +++ a/mm/hugetlb.c
> @@ -690,15 +690,11 @@ static void prep_compound_gigantic_page(
>   */
>  int PageHuge(struct page *page)
>  {
> -	compound_page_dtor *dtor;
> -
>  	if (!PageCompound(page))
>  		return 0;
>  
>  	page = compound_head(page);
> -	dtor = get_compound_page_dtor(page);
> -
> -	return dtor == free_huge_page;
> +	return get_compound_page_dtor(page) == free_huge_page;
>  }
>  EXPORT_SYMBOL_GPL(PageHuge);
>  
> @@ -708,14 +704,10 @@ EXPORT_SYMBOL_GPL(PageHuge);
>   */
>  int PageHeadHuge(struct page *page_head)
>  {
> -	compound_page_dtor *dtor;
> -
>  	if (!PageHead(page_head))
>  		return 0;
>  
> -	dtor = get_compound_page_dtor(page_head);
> -
> -	return dtor == free_huge_page;
> +	return get_compound_page_dtor(page_head) == free_huge_page;
>  }
>  EXPORT_SYMBOL_GPL(PageHeadHuge);

Sure good idea!

> > @@ -82,19 +82,6 @@ static void __put_compound_page(struct page *page)
> >  
> >  static void put_compound_page(struct page *page)
> 
> This function has become quite crazy.  I sat down to refamiliarize but
> immediately failed.
> 
> : static void put_compound_page(struct page *page)
> : {
> : 	if (unlikely(PageTail(page))) {
> :	...
> : 	} else if (put_page_testzero(page)) {
> : 		if (PageHead(page))
> 
> How can a page be both PageTail() and PageHead()?

We execute the PageHead you quoted only if it's !PageTail. So then
PageHead tells us if if it's compound head or not compound by the time
all reference counts have been released (by the time the last
reference is released it can't be splitted anymore).

> 
> : 			__put_compound_page(page);
> : 		else
> : 			__put_single_page(page);
> : 	}
> : }
> : 
> : 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-11-20  0:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-15 17:47 [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Andrea Arcangeli
2013-11-15 17:47 ` [PATCH 1/3] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
2013-11-19 23:11   ` Andrew Morton
2013-11-20  0:26     ` Andrea Arcangeli
2013-11-15 17:47 ` [PATCH 2/3] mm: hugetlb: use get_page_foll in follow_hugetlb_page Andrea Arcangeli
2013-11-19 21:27   ` Khalid Aziz
2013-11-15 17:47 ` [PATCH 3/3] mm: tail page refcounting optimization for slab and hugetlbfs Andrea Arcangeli
2013-11-19 21:27   ` Khalid Aziz
2013-11-19 23:14   ` Andrew Morton
2013-11-20  0:20     ` Andrea Arcangeli
2013-11-18 18:04 ` [PATCH 0/3] mm: hugetlbfs: fix hugetlbfs optimization v2 Khalid Aziz
2013-11-19 20:27   ` Khalid Aziz
2013-11-19 22:52     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).