All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00 of 30] Transparent Hugepage support #3
@ 2010-01-21  6:20 Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
                   ` (31 more replies)
  0 siblings, 32 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

Hello,

this is the latest version of my patchset, it has all cleanups requested by
previous review on linux-mm and more fixes and it ships the first working
khugepaged daemon.

This seems feature complete as far as KVM is concerned, "madvise" mode for
both /sys/kernel/mm/transparent_hugepage/enabled and
/sys/kernel/mm/transparent_hugepage/khugepaged/enabled is enough for
hypervisor utilization. The default of the patchset is "always" for both, to
be sure the new code gets exercised even by all apps that could benefit from
this (yes, khugepaged is transparently enabled on all mappings with
what I think is a negligeable/unmesurable overhead and perhaps it becomes
beneficial for long-living vmas with intensive computations on them).

TODO (first things that come to mind):

- at leats smaps should stop calling split_huge_page

- find a way to fix the lru statistics so they will show the right ram amount
  (statistic code these days seems almost more complex than the real useful
  code, especially the isolated lru counter seems very dubious in its
  usefulness and it's further pain to deal with all over the VM). Fixing these
  counters is after all low priority because I know no app aware about the VM
  internals and depending on the exact size of
  inactive/active/anon/file/unevictable lru lists. fixign smaps not to split
  hugepages is much higher priority. The stats don't overflow or underflow,
  they're just not right.

- maybe add some other stat in addition to AnonHugePages in /proc/meminfo. You
  can monitor the effect of khugepaged or of an mprotect calling
  split_huge_page trivially with "grep Anon /proc/meminfo"

- I need to integrate Mel's memory compation code to be used by khugepaged and
  by the page faults if "defrag" sysfs file setting requires it. His results
  (especially with the bug fixes that decreased reclaim a lot) looks promising.

- likely we'll need a slab front allocator too allocating in 2m chunks, but
  this should be re-evaluated after merging Mel's work, maybe he already did
  that.

- khugepaged isn't yet capable of merging readonly shared anon pages, that
  isn't needed by KVM (KVM uses MADV_DONTFORK) but it might want to learn it
  for other apps

- khugepaged should also learn to skip the copy and collapse the hugepage
  in-place, if possible (to undo the effect of surpious split_huge_page)

I'm leaving this under a continous stress with scan_sleep_millisecs and
defrag_sleep_millisecs set to 0 and a 5G swap storm + ~4G in ram. The swap storm
before settling in pause() will call madvise to split all hugepages in ram and
then it will run a further memset again to swapin everything a second time.
Eventually it will settle and khugepaged will remerge as many hugepages as
they're fully mapped in userland (mapped as swapcache is ok, but khugepaged
will not trigger swapin I/O or swapcache minor fault) if there are enough not
fragmented hugepages available.

This is shortly after start.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
0  5 3938052  67688    208   3684 1219 2779  1239  2779  396 1061  1  5 75 20
2  5 3937092  61612    208   3712 25120 24112 25120 24112 7420 5396  0  8 44 48
0  5 3932116  55536    208   3780 26444 21468 26444 21468 7532 5399  0  8 52 40
0  5 3927264  46724    208   3296 28208 22528 28328 22528 7871 5722  0  7 52 41
AnonPages:       1751352 kB
AnonHugePages:   2021376 kB
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
0  5 3935604  58092    208   3864 1233 2787  1253  2787  400 1061  1  5 74 20
0  5 3933924  54248    208   3508 23748 23548 23904 23548 7112 4829  0  6 49 45
1  4 3937708  60696    208   3704 24680 28680 24760 28680 7034 5112  0  8 50 42
1  4 3934508  59084    208   3304 24096 21020 24156 21020 6832 5015  0  7 48 46
AnonPages:       1746296 kB
AnonHugePages:   2023424 kB

this is after it settled and it's waiting in pause(). khugepaged when it's not
copying with defrag_sleep/scan_sleep both = 0, just trigers a
superoverschedule, but as you can see it's extremely low overhead, only taking
8% of 4 cores or 32% of 1 core. Likely most of the cpu is taking by schedule().
So you can imagine how low overhead it is when sleep is set to a "production"
level and not stress test level. Default sleep is 10seconds and not 2usec...

1  0 5680228 106028    396   5060    0    0     0     0  534 341005  0  8 92  0
1  0 5680228 106028    396   5060    0    0     0     0  517 349159  0  9 91  0
1  0 5680228 106028    396   5060    0    0     0     0  518 346356  0  6 94  0
0  0 5680228 106028    396   5060    0    0     0     0  511 348478  0  8 92  0
AnonPages:        392396 kB
AnonHugePages:   3371008 kB

So it looks good so far.

I think it's probably time to port the patchset to mmotd.
Further review welcome!

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01 of 30] define MADV_HUGEPAGE
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 02 of 30] compound_lock Andrea Arcangeli
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,6 +45,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 02 of 30] compound_lock
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -294,6 +295,16 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+	bit_spin_lock(PG_compound_lock, &page->flags);
+}
+
+static inline void compound_unlock(struct page *page)
+{
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,7 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+	PG_compound_lock,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 03 of 30] alter compound get_page/put_page
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 02 of 30] compound_lock Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21 17:35   ` Dave Hansen
  2010-01-21  6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
                   ` (28 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -43,6 +43,14 @@ static noinline int gup_pte_range(pmd_t 
 		page = pte_page(pte);
 		if (!page_cache_get_speculative(page))
 			return 0;
+		if (PageTail(page)) {
+			/*
+			 * __split_huge_page_refcount() cannot run
+			 * from under us.
+			 */
+			VM_BUG_ON(atomic_read(&page->_count) < 0);
+			atomic_inc(&page->_count);
+		}
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_page(page);
 			return 0;
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -128,6 +128,14 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page)) {
+			/*
+			 * __split_huge_page_refcount() cannot run
+			 * from under us.
+			 */
+			VM_BUG_ON(atomic_read(&page->_count) < 0);
+			atomic_inc(&page->_count);
+		}
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -319,9 +319,14 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+		/* __split_huge_page_refcount can't run under get_page */
+		VM_BUG_ON(!PageTail(page));
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -409,7 +409,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 1 << PG_compound_lock)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,80 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_page(page);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+			out_put_head:
+				put_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+			else
+				compound_unlock(page_head);
+			return;
+		} else
+			/* page_head is a dangling pointer */
+			goto out_put_single;
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +137,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 04 of 30] clear compound mapping
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21 17:43   ` Dave Hansen
  2010-01-21  6:20 ` [PATCH 05 of 30] add native_set_pmd_at Andrea Arcangeli
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -584,6 +584,8 @@ static void __free_pages_ok(struct page 
 
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 05 of 30] add native_set_pmd_at
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 06 of 30] add pmd paravirt ops Andrea Arcangeli
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -528,6 +528,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 06 of 30] add pmd paravirt ops
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 05 of 30] add native_set_pmd_at Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 07 of 30] no paravirt version of pmd ops Andrea Arcangeli
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -449,6 +449,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -456,6 +461,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -557,6 +568,16 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 07 of 30] no paravirt version of pmd ops
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 06 of 30] add pmd paravirt ops Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 08 of 30] export maybe_mkwrite Andrea Arcangeli
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -33,6 +33,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -57,6 +58,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 08 of 30] export maybe_mkwrite
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 07 of 30] no paravirt version of pmd ops Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 09 of 30] comment reminder in destroy_compound_page Andrea Arcangeli
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -380,6 +380,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1943,19 +1943,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 09 of 30] comment reminder in destroy_compound_page
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 08 of 30] export maybe_mkwrite Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 10 of 30] config_transparent_hugepage Andrea Arcangeli
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -311,6 +311,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 10 of 30] config_transparent_hugepage
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 09 of 30] comment reminder in destroy_compound_page Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -283,3 +283,17 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support" if EMBEDDED
+	depends on X86_64
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 11 of 30] add pmd mangling functions to x86
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 10 of 30] config_transparent_hugepage Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21 17:47   ` Dave Hansen
  2010-01-21  6:20 ` [PATCH 12 of 30] add pmd mangling generic functions Andrea Arcangeli
                   ` (20 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -95,11 +95,21 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
 static inline int pte_file(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_FILE;
@@ -150,6 +160,13 @@ static inline pte_t pte_set_flags(pte_t 
 	return native_make_pte(v | set);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
 static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 {
 	pteval_t v = native_pte_val(pte);
@@ -157,6 +174,13 @@ static inline pte_t pte_clear_flags(pte_
 	return native_make_pte(v & ~clear);
 }
 
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -167,11 +191,21 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkexec(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_NX);
@@ -182,16 +216,36 @@ static inline pte_t pte_mkdirty(pte_t pt
 	return pte_set_flags(pte, _PAGE_DIRTY);
 }
 
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
 static inline pte_t pte_mkyoung(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
 static inline pte_t pte_mkwrite(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_PSE);
@@ -320,6 +374,11 @@ static inline int pte_same(pte_t a, pte_
 	return a.pte == b.pte;
 }
 
+static inline int pmd_same(pmd_t a, pmd_t b)
+{
+	return a.pmd == b.pmd;
+}
+
 static inline int pte_present(pte_t a)
 {
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
@@ -351,7 +410,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
@@ -372,6 +431,7 @@ static inline unsigned long pmd_index(un
  * to linux/mm.h:page_to_nid())
  */
 #define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
 /*
  * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE]
@@ -568,14 +628,21 @@ struct vm_area_struct;
 extern int ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 				     unsigned long addr, pte_t *ptep);
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
@@ -586,6 +653,14 @@ static inline pte_t ptep_get_and_clear(s
 	return pte;
 }
 
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep,
@@ -612,6 +687,16 @@ static inline void ptep_set_wrprotect(st
 	pte_update(mm, addr, ptep);
 }
 
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmd);
+}
+
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -71,6 +71,18 @@ static inline pte_t native_ptep_get_and_
 	return ret;
 #endif
 }
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(NULL, 0, xp);
+	return ret;
+#endif
+}
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -288,6 +288,23 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+
+	return changed;
+}
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -303,6 +320,21 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -315,6 +347,34 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 12 of 30] add pmd mangling generic functions
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 13 of 30] special pmd_trans_* functions Andrea Arcangeli
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -23,6 +23,19 @@
 	}								  \
 	__changed;							  \
 })
+
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);		\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_PMD_SIZE);	\
+		}							\
+		__changed;						\
+	})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
@@ -37,6 +50,17 @@
 			   (__ptep), pte_mkold(__pte));			\
 	r;								\
 })
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -48,6 +72,16 @@
 		flush_tlb_page(__vma, __address);			\
 	__young;							\
 })
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_PMD_SIZE);		\
+	__young;							\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -57,6 +91,13 @@
 	pte_clear((__mm), (__address), (__ptep));			\
 	__pte;								\
 })
+
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
@@ -88,6 +129,15 @@ do {									\
 	flush_tlb_page(__vma, __address);				\
 	__pte;								\
 })
+
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -97,10 +147,26 @@ static inline void ptep_set_wrprotect(st
 	pte_t old_pte = *ptep;
 	set_pte_at(mm, address, ptep, pte_wrprotect(old_pte));
 }
+
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
 #endif
 
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 13 of 30] special pmd_trans_* functions
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 12 of 30] add pmd mangling generic functions Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 14 of 30] bail out gup_fast on splitting pmd Andrea Arcangeli
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -394,6 +394,24 @@ static inline int pmd_present(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_PRESENT;
 }
 
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+#else
+	return 0;
+#endif
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_val(pmd) & _PAGE_PSE;
+#else
+	return 0;
+#endif
+}
+
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 14 of 30] bail out gup_fast on splitting pmd
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 13 of 30] special pmd_trans_* functions Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 15 of 30] pte alloc trans splitting Andrea Arcangeli
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is splitting, not
only if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -156,7 +156,18 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 15 of 30] pte alloc trans splitting
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 14 of 30] bail out gup_fast on splitting pmd Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 16 of 30] add pmd mmu_notifier helpers Andrea Arcangeli
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -948,7 +948,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1017,12 +1018,14 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -324,9 +324,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -346,14 +348,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -366,10 +372,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3020,7 +3027,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 16 of 30] add pmd mmu_notifier helpers
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 15 of 30] pte alloc trans splitting Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 17 of 30] clear page compound Andrea Arcangeli
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 17 of 30] clear page compound
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 16 of 30] add pmd mmu_notifier helpers Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 18 of 30] add pmd_huge_pte to mm_struct Andrea Arcangeli
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -347,7 +347,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -355,6 +355,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -392,6 +399,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(page->flags & PG_head_tail_mask != (1L << PG_compound));
+	ClearPageCompound(page);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 18 of 30] add pmd_huge_pte to mm_struct
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 17 of 30] clear page compound Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 19 of 30] ensure mapcount is taken on head pages Andrea Arcangeli
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -291,6 +291,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,6 +498,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -638,6 +641,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 19 of 30] ensure mapcount is taken on head pages
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 18 of 30] add pmd_huge_pte to mm_struct Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 20 of 30] split_huge_page_mm/vma Andrea Arcangeli
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Unlike the page count, the page mapcount cannot be taken on PageTail compound
pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -105,6 +105,7 @@ void page_remove_rmap(struct page *);
 
 static inline void page_dup_rmap(struct page *page)
 {
+	VM_BUG_ON(PageTail(page));
 	atomic_inc(&page->_mapcount);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -674,6 +674,7 @@ void page_add_anon_rmap(struct page *pag
 	struct vm_area_struct *vma, unsigned long address)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
+	VM_BUG_ON(PageTail(page));
 	if (first)
 		__inc_zone_page_state(page, NR_ANON_PAGES);
 	if (unlikely(PageKsm(page)))
@@ -701,6 +702,7 @@ void page_add_new_anon_rmap(struct page 
 	struct vm_area_struct *vma, unsigned long address)
 {
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	VM_BUG_ON(PageTail(page));
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	__inc_zone_page_state(page, NR_ANON_PAGES);
@@ -733,6 +735,7 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
+	VM_BUG_ON(PageTail(page));
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
@@ -1281,6 +1284,7 @@ static int rmap_walk_file(struct page *p
 int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
+	VM_BUG_ON(PageTail(page));
 	VM_BUG_ON(!PageLocked(page));
 
 	if (unlikely(PageKsm(page)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 20 of 30] split_huge_page_mm/vma
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 19 of 30] ensure mapcount is taken on head pages Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 21 of 30] split_huge_page paging Andrea Arcangeli
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_mm/vma compat code. Each one of those would need to be expanded
to hundred of lines of complex code without a fully reliable
split_huge_page_mm/vma functionality.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_mm(mm, 0xA0000, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,6 +446,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_vma(vma, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -132,6 +132,7 @@ static long do_mincore(unsigned long add
 	if (pud_none_or_clear_bad(pud))
 		goto none_mapped;
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_vma(vma, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto none_mapped;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -89,6 +89,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,6 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_mm(mm, addr, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(walk->mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 21 of 30] split_huge_page paging
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 20 of 30] split_huge_page_mm/vma Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -378,6 +378,8 @@ static void collect_procs_anon(struct pa
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
+	if (unlikely(split_huge_page(page)))
+		return;
 	read_lock(&tasklist_lock);
 	av = page_lock_anon_vma(page);
 	if (av == NULL)	/* Not actually mapped anymore */
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1178,6 +1178,10 @@ int try_to_unmap(struct page *page, enum
 
 	BUG_ON(!PageLocked(page));
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			return SWAP_AGAIN;
+
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
 	else if (PageAnon(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -156,6 +156,12 @@ int add_to_swap(struct page *page)
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -905,6 +905,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 22 of 30] pmd_trans_huge migrate bugcheck
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 21 of 30] split_huge_page paging Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21 20:40   ` Christoph Lameter
  2010-01-21  6:20 ` [PATCH 23 of 30] clear_copy_huge_page Andrea Arcangeli
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
try_to_unmap will split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -99,6 +99,7 @@ static int remove_migration_pte(struct p
 		goto out;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 23 of 30] clear_copy_huge_page
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 24 of 30] kvm mmu transparent hugepage support Andrea Arcangeli
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1379,5 +1379,14 @@ extern void shake_page(struct page *p, i
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_huge_page(struct page *dst, struct page *src,
+			   unsigned long addr, struct vm_area_struct *vma,
+			   unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,70 +385,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-static void copy_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2334,7 +2270,8 @@ retry_avoidcopy:
 		return -PTR_ERR(new_page);
 	}
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_huge_page(new_page, old_page, address, vma,
+		       pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3396,3 +3396,73 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_gigantic_page(struct page *dst, struct page *src,
+			       unsigned long addr,
+			       struct vm_area_struct *vma,
+			       unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+void copy_huge_page(struct page *dst, struct page *src,
+		    unsigned long addr, struct vm_area_struct *vma,
+		    unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src, addr, vma, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE,
+				   vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 24 of 30] kvm mmu transparent hugepage support
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 23 of 30] clear_copy_huge_page Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 25 of 30] transparent hugepage core Andrea Arcangeli
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,15 @@ static int host_mapping_level(struct kvm
 out:
 	up_read(&current->mm->mmap_sem);
 
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE) {
+		struct page *page = gfn_to_page(kvm, gfn);
+
+		if (!is_error_page(page) && PageHead(page))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_page_clean(page);
+	}
+
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
 		if (page_size >= KVM_HPAGE_SIZE(i))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 25 of 30] transparent hugepage core
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 24 of 30] kvm mmu transparent hugepage support Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 26 of 30] madvise(MADV_HUGEPAGE) Andrea Arcangeli
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,124 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &				\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)			       \
+	(transparent_hugepage_flags &				       \
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
+	 (transparent_hugepage_flags &				       \
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -234,6 +237,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,847 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+/*
+ * Currently uses __GFP_REPEAT during allocation. Should be
+ * implemented using page migration and real defrag algorithms in
+ * future VM.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init ksm_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(ksm_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the clear_huge_page writes to
+	 * become visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			   (defrag ? __GFP_REPEAT : 0)|__GFP_NOWARN,
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
+						    page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kzalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		put_page(page);
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	page_remove_rmap(page);
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	if (unlikely(transparent_hugepage_debug_cow()) && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+	if (unlikely(!new_page))
+		return do_huge_pmd_wp_page_fallback(mm, vma, address,
+						    pmd, orig_pmd, page, haddr);
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			VM_BUG_ON(!PageCompound(page));
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			spin_unlock(&tlb->mm->page_table_lock);
+			add_mm_counter(tlb->mm, anon_rss, -HPAGE_PMD_NR);
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		int splitted;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		splitted = __split_huge_page_splitting(page, vma, addr);
+		VM_BUG_ON(splitted && addr & ~HPAGE_PMD_MASK);
+		mapcount += splitted;
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		int splitted;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		splitted = __split_huge_page_map(page, vma, addr);
+		VM_BUG_ON(splitted && addr & ~HPAGE_PMD_MASK);
+		mapcount2 += splitted;
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+	BUG_ON(down_write_trylock(&mm->mmap_sem));
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address + HPAGE_PMD_SIZE - 1);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -647,9 +647,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_vma(vma, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -380,9 +363,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -459,6 +459,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page);
+		add_page_to_lru_list(zone, page, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 26 of 30] madvise(MADV_HUGEPAGE)
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 25 of 30] transparent hugepage core Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,6 +101,7 @@ extern int split_huge_page(struct page *
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -119,6 +120,11 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -463,9 +463,11 @@ int do_huge_pmd_wp_page(struct mm_struct
 		put_page(new_page);
 		new_page = NULL;
 	}
-	if (unlikely(!new_page))
-		return do_huge_pmd_wp_page_fallback(mm, vma, address,
-						    pmd, orig_pmd, page, haddr);
+	if (unlikely(!new_page)) {
+		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+						   pmd, orig_pmd, page, haddr);
+		goto out;
+	}
 
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
@@ -495,6 +497,7 @@ int do_huge_pmd_wp_page(struct mm_struct
 	}
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+out:
 	return ret;
 }
 
@@ -845,3 +848,19 @@ out_unlock:
 out:
 	return ret;
 }
+
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 27 of 30] memcg compound
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 26 of 30] madvise(MADV_HUGEPAGE) Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  7:07   ` KAMEZAWA Hiroyuki
  2010-01-21  6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1288,15 +1288,20 @@ static atomic_t memcg_drain_count;
  * cgroup which is not current target, returns false. This stock will be
  * refilled.
  */
-static bool consume_stock(struct mem_cgroup *mem)
+static bool consume_stock(struct mem_cgroup *mem, int *page_size)
 {
 	struct memcg_stock_pcp *stock;
 	bool ret = true;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (mem == stock->cached && stock->charge)
-		stock->charge -= PAGE_SIZE;
-	else /* need to call res_counter_charge */
+	if (mem == stock->cached && stock->charge) {
+		if (*page_size > stock->charge) {
+			*page_size -= stock->charge;
+			stock->charge = 0;
+			ret = false;
+		} else
+			stock->charge -= *page_size;
+	} else /* need to call res_counter_charge */
 		ret = false;
 	put_cpu_var(memcg_stock);
 	return ret;
@@ -1401,13 +1406,13 @@ static int __cpuinit memcg_stock_cpu_cal
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg,
-			bool oom, struct page *page)
+				   gfp_t gfp_mask, struct mem_cgroup **memcg,
+				   bool oom, struct page *page, int page_size)
 {
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
-	int csize = CHARGE_SIZE;
+	int csize = max(page_size, (int) CHARGE_SIZE);
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1439,7 +1444,7 @@ static int __mem_cgroup_try_charge(struc
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (consume_stock(mem))
+		if (consume_stock(mem, &page_size))
 			goto charged;
 
 		ret = res_counter_charge(&mem->res, csize, &fail_res);
@@ -1460,8 +1465,8 @@ static int __mem_cgroup_try_charge(struc
 									res);
 
 		/* reduce request size and retry */
-		if (csize > PAGE_SIZE) {
-			csize = PAGE_SIZE;
+		if (csize > page_size) {
+			csize = page_size;
 			continue;
 		}
 		if (!(gfp_mask & __GFP_WAIT))
@@ -1491,8 +1496,8 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
-	if (csize > PAGE_SIZE)
-		refill_stock(mem, csize - PAGE_SIZE);
+	if (csize > page_size)
+		refill_stock(mem, csize - page_size);
 charged:
 	/*
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1512,12 +1517,12 @@ nomem:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, page_size);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, page_size);
 	}
 	css_put(&mem->css);
 }
@@ -1575,8 +1580,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -1585,7 +1591,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -1722,7 +1728,8 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
+		PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
@@ -1730,7 +1737,7 @@ static int mem_cgroup_move_parent(struct
 	if (!ret)
 		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
 	else
-		mem_cgroup_cancel_charge(parent);	/* does css_put */
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
 put_back:
 	putback_lru_page(page);
 put:
@@ -1752,6 +1759,11 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	VM_BUG_ON(PageTail(page));
+	if (PageHead(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -1760,11 +1772,12 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
+				      page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -1773,8 +1786,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -1787,7 +1798,7 @@ int mem_cgroup_newpage_charge(struct pag
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 static void
@@ -1880,14 +1891,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
 }
 
 static void
@@ -1903,7 +1914,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -1952,11 +1963,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -1989,14 +2001,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	return;
 }
 
@@ -2009,6 +2021,11 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	VM_BUG_ON(PageTail(page));
+	if (PageHead(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2016,6 +2033,8 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageSwapCache(page))
 		return NULL;
 
+	VM_BUG_ON(PageTail(page));
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2048,7 +2067,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -2217,7 +2236,7 @@ int mem_cgroup_prepare_migration(struct 
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
-						page);
+					      page, PAGE_SIZE);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2260,7 +2279,7 @@ void mem_cgroup_end_migration(struct mem
 	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
 	 * So, double-counting is effectively avoided.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 
 	/*
 	 * Both of oldpage and newpage are still under lock_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 28 of 30] memcg huge memory
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  7:16   ` KAMEZAWA Hiroyuki
  2010-01-21  6:20 ` [PATCH 29 of 30] transparent hugepage vmstat Andrea Arcangeli
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -212,6 +212,7 @@ static int __do_huge_pmd_anonymous_page(
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, address);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -228,6 +229,7 @@ static int __do_huge_pmd_anonymous_page(
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -265,6 +267,10 @@ int do_huge_pmd_anonymous_page(struct mm
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
 						    page, haddr);
@@ -365,9 +371,15 @@ static int do_huge_pmd_wp_page_fallback(
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 					  vma, address);
-		if (unlikely(!pages[i])) {
-			while (--i >= 0)
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_newpage_charge(pages[i], mm,
+						       GFP_KERNEL))) {
+			if (pages[i])
 				put_page(pages[i]);
+			while (--i >= 0) {
+				mem_cgroup_uncharge_page(pages[i]);
+				put_page(pages[i]);
+			}
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
@@ -426,8 +438,10 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out;
 }
@@ -469,6 +483,11 @@ int do_huge_pmd_wp_page(struct mm_struct
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
@@ -480,9 +499,10 @@ int do_huge_pmd_wp_page(struct mm_struct
 	smp_wmb();
 
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 29 of 30] transparent hugepage vmstat
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-21  6:20 ` [PATCH 30 of 30] khugepaged Andrea Arcangeli
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -151,6 +154,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,7 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -715,6 +715,9 @@ static void __split_huge_page_refcount(s
 		put_page(page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -688,8 +688,13 @@ void page_add_anon_rmap(struct page *pag
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	VM_BUG_ON(PageTail(page));
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageTransHuge(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -718,7 +723,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(PageTail(page));
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageTransHuge(page))
+	    __inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+	    __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -766,7 +774,11 @@ void page_remove_rmap(struct page *page)
 	}
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageTransHuge(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -655,6 +655,9 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"nr_anon_transparent_hugepages",
+#endif
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 30 of 30] khugepaged
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 29 of 30] transparent hugepage vmstat Andrea Arcangeli
@ 2010-01-21  6:20 ` Andrea Arcangeli
  2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
  2010-01-26 11:24 ` Mel Gorman
  31 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21  6:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -25,6 +25,8 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
@@ -45,12 +47,14 @@ extern pmd_t *page_check_address_pmd(str
 	 (transparent_hugepage_flags &				\
 	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
 	  (__vma)->vm_flags & VM_HUGEPAGE))
-#define transparent_hugepage_defrag(__vma)			       \
+#define __transparent_hugepage_defrag(__in_madv)		       \
 	(transparent_hugepage_flags &				       \
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
 	 (transparent_hugepage_flags &				       \
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
-	  (__vma)->vm_flags & VM_HUGEPAGE))
+	  (__in_madv)))
+#define transparent_hugepage_defrag(__vma)				\
+	__transparent_hugepage_defrag((__vma)->vm_flags & VM_HUGEPAGE)
 #ifdef CONFIG_DEBUG_VM
 #define transparent_hugepage_debug_cow()				\
 	(transparent_hugepage_flags &					\
@@ -101,7 +105,7 @@ extern int split_huge_page(struct page *
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
-extern int hugepage_madvise(unsigned long *vm_flags);
+extern int hugepage_madvise(struct mm_struct *mm, unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -120,7 +124,8 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
-static inline int hugepage_madvise(unsigned long *vm_flags)
+static inline int hugepage_madvise(struct mm_struct *mm,
+				   unsigned long *vm_flags)
 {
 	BUG_ON(0);
 	return 0;
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
new file mode 100644
--- /dev/null
+++ b/include/linux/khugepaged.h
@@ -0,0 +1,32 @@
+#ifndef _LINUX_KHUGEPAGED_H
+#define _LINUX_KHUGEPAGED_H
+
+#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_exit(struct mm_struct *mm);
+
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		return __khugepaged_enter(mm);
+	return 0;
+}
+
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	return 0;
+}
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -482,6 +482,7 @@ extern int get_dumpable(struct mm_struct
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,6 +65,7 @@
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
+#include <linux/khugepaged.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -306,6 +307,9 @@ static int dup_mmap(struct mm_struct *mm
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -515,6 +519,7 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
+		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,14 +12,119 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
 unsigned long transparent_hugepage_flags __read_mostly =
-	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG);
+
+/* default scan 16 pmd every 10 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*16;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_defrag_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+
+static int khugepaged(void *none);
+static int mm_slots_hash_init(void);
+static int ksm_slab_init(void);
+static void ksm_slab_free(void);
+
+#define MM_SLOTS_HASH_HEADS 1024
+static struct hlist_head *mm_slots_hash __read_mostly;
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+} khugepaged_scan = {
+	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+#define khugepaged_enabled()					       \
+	(transparent_hugepage_flags &				       \
+	 ((1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG) |		       \
+	  (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG)))
+#define khugepaged_always()				\
+	(transparent_hugepage_flags &			\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG))
+#define khugepaged_req_madv()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG))
+
+static int start_khugepaged(void)
+{
+	int err = 0;
+	if (khugepaged_enabled()) {
+		int wakeup;
+		if (unlikely(!mm_slot_cache || !mm_slots_hash)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_thread)
+			khugepaged_thread = kthread_run(khugepaged, NULL,
+							"khugepaged");
+		if (unlikely(IS_ERR(khugepaged_thread))) {
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				  &transparent_hugepage_flags);
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
+				  &transparent_hugepage_flags);
+			printk(KERN_ERR
+			       "khugepaged: kthread_run(khugepaged) failed\n");
+			err = PTR_ERR(khugepaged_thread);
+			khugepaged_thread = NULL;
+		}
+		wakeup = !list_empty(&khugepaged_scan.mm_head);
+		mutex_unlock(&khugepaged_mutex);
+		if (wakeup)
+			wake_up_interruptible(&khugepaged_wait);
+	} else
+		/* wakeup to exit */
+		wake_up_interruptible(&khugepaged_wait);
+out:
+	return err;
+}
 
 #ifdef CONFIG_SYSFS
+
+static void wakeup_khugepaged(void)
+{
+	mutex_lock(&khugepaged_mutex);
+	if (khugepaged_thread)
+		wake_up_process(khugepaged_thread);
+	mutex_unlock(&khugepaged_mutex);
+}
+
 static ssize_t double_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag enabled,
@@ -153,20 +258,168 @@ static struct attribute *hugepage_attr[]
 
 static struct attribute_group hugepage_attr_group = {
 	.attrs = hugepage_attr,
-	.name = "transparent_hugepage",
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_scan_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+	__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+	       scan_sleep_millisecs_store);
+
+static ssize_t defrag_sleep_millisecs_show(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_defrag_sleep_millisecs);
+}
+
+static ssize_t defrag_sleep_millisecs_store(struct kobject *kobj,
+					    struct kobj_attribute *attr,
+					    const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_defrag_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute defrag_sleep_millisecs_attr =
+	__ATTR(defrag_sleep_millisecs, 0644, defrag_sleep_millisecs_show,
+	       defrag_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long pages;
+
+	err = strict_strtoul(buf, 10, &pages);
+	if (err || !pages || pages > UINT_MAX || pages & ~HPAGE_PMD_MASK)
+		return -EINVAL;
+
+	khugepaged_pages_to_scan = pages;
+
+	return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+	__ATTR(pages_to_scan, 0644, pages_to_scan_show,
+	       pages_to_scan_store);
+
+static ssize_t khugepaged_enabled_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+}
+static ssize_t khugepaged_enabled_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = double_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+	if (ret > 0) {
+		int err = start_khugepaged();
+		if (err)
+			ret = err;
+	}
+	return ret;
+}
+static struct kobj_attribute khugepaged_enabled_attr =
+	__ATTR(enabled, 0644, khugepaged_enabled_show,
+	       khugepaged_enabled_store);
+
+static struct attribute *khugepaged_attr[] = {
+	&khugepaged_enabled_attr.attr,
+	&pages_to_scan_attr.attr,
+	&scan_sleep_millisecs_attr.attr,
+	&defrag_sleep_millisecs_attr.attr,
+	NULL,
+};
+
+static struct attribute_group khugepaged_attr_group = {
+	.attrs = khugepaged_attr,
+	.name = "khugepaged",
 };
 #endif /* CONFIG_SYSFS */
 
 static int __init ksm_init(void)
 {
+	int err;
 #ifdef CONFIG_SYSFS
-	int err;
+	static struct kobject *hugepage_kobj;
 
-	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	err = -ENOMEM;
+	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
+	if (unlikely(!hugepage_kobj)) {
+		printk(KERN_ERR "hugepage: failed kobject create\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+#endif
+
+	err = ksm_slab_init();
 	if (err)
-		printk(KERN_ERR "hugepage: register sysfs failed\n");
-#endif
-	return 0;
+		goto out;
+
+	err = mm_slots_hash_init();
+	if (err) {
+		ksm_slab_free();
+		goto out;
+	}
+
+	start_khugepaged();
+
+out:
+	return err;
 }
 module_init(ksm_init)
 
@@ -253,6 +506,11 @@ static inline struct page *alloc_hugepag
 			   HPAGE_PMD_ORDER);
 }
 
+static inline struct page *alloc_hugepage_defrag(void)
+{
+	return alloc_hugepage(1);
+}
+
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
@@ -264,6 +522,12 @@ int do_huge_pmd_anonymous_page(struct mm
 	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
+		if (unlikely(!test_bit(MMF_VM_HUGEPAGE, &mm->flags)))
+			if (khugepaged_always() ||
+			    (khugepaged_req_madv() &&
+			     vma->vm_flags & VM_HUGEPAGE))
+				if (__khugepaged_enter(mm))
+					return VM_FAULT_OOM;
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
@@ -872,7 +1136,7 @@ out:
 	return ret;
 }
 
-int hugepage_madvise(unsigned long *vm_flags)
+int hugepage_madvise(struct mm_struct *mm, unsigned long *vm_flags)
 {
 	/*
 	 * Be somewhat over-protective like KSM for now!
@@ -887,3 +1151,630 @@ int hugepage_madvise(unsigned long *vm_f
 
 	return 0;
 }
+
+static int __init ksm_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init ksm_slab_free(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+	mm_slot_cache = NULL;
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static int __init mm_slots_hash_init(void)
+{
+	mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head),
+				GFP_KERNEL);
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	return 0;
+}
+
+#if 0
+static void __init mm_slots_hash_free(void)
+{
+	kfree(mm_slots_hash);
+	mm_slots_hash = NULL;
+}
+#endif
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int wakeup;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	spin_lock(&khugepaged_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	wakeup = list_empty(&khugepaged_scan.mm_head);
+	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+	set_bit(MMF_VM_HUGEPAGE, &mm->flags);
+	spin_unlock(&khugepaged_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	if (wakeup)
+		wake_up_interruptible(&khugepaged_wait);
+
+	return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+
+	if (free) {
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		spin_unlock(&khugepaged_mm_lock);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		spin_unlock(&khugepaged_mm_lock);
+		/*
+		 * This is required to serialize against
+		 * khugepaged_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static void release_pte_page(struct page *page)
+{
+	/* 0 stands for page_is_file_cache(page) == false */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+	while (--_pte >= pte)
+		release_pte_page(pte_page(*_pte));
+}
+
+static void release_all_pte_pages(pte_t *pte)
+{
+	release_pte_pages(pte, pte + HPAGE_PMD_NR);
+}
+
+static int __collapse_huge_page_isolate(pte_t *pte)
+{
+	struct page *page;
+	pte_t *_pte;
+	int referenced = 0, isolated = 0;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval) || !pte_write(pteval)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* If there is no mapped pte young don't collapse the page */
+		if (pte_young(pteval))
+			referenced = 1;
+		page = pte_page(pteval);
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * seems entirely unnecessary but in doubt this is
+		 * safer. If proven unnecessary it can be removed.
+		 */
+		if (!trylock_page(page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* 0 stands for page_is_file_cache(page) == false */
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		VM_BUG_ON(!PageLocked(page));
+		VM_BUG_ON(PageLRU(page));
+	}
+	if (unlikely(!referenced))
+		release_all_pte_pages(pte);
+	else
+		isolated = 1;
+out:
+	return isolated;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address)
+{
+	pte_t *_pte;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		struct page *src_page = pte_page(*_pte);
+		/* paravirt calls inside pte_clear here are superflous */
+		pte_clear(vma->vm_mm, address, _pte);
+		copy_user_highpage(page, src_page, address, vma);
+		VM_BUG_ON(page_mapcount(src_page) != 1);
+		VM_BUG_ON(page_count(src_page) != 2);
+		release_pte_page(src_page);
+		page_remove_rmap(src_page);
+		free_page_and_swap_cache(src_page);
+
+		address += PAGE_SIZE;
+		page++;
+	}
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	struct vm_area_struct *vma;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	pgtable_t pgtable;
+	struct page *new_page;
+	spinlock_t *ptl;
+	int isolated;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(!*hpage);
+
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later hanlded by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	down_write(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	vma = find_vma(mm, address + HPAGE_PMD_SIZE - 1);
+	if (vma->vm_start > address)
+		goto out;
+
+	if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+		goto out;
+
+	if (!vma->anon_vma || vma->vm_ops)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	/* pmd can't go away or become huge under us */
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	/* stop anon_vma rmap pagetable access */
+	spin_lock(&vma->anon_vma->lock);
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	/* after this gup_fast can't run anymore */
+	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	spin_lock(ptl);
+	isolated = __collapse_huge_page_isolate(pte);
+	spin_unlock(ptl);
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+	spin_unlock(&vma->anon_vma->lock);
+	pte_unmap(pte);
+
+	if (unlikely(!isolated)) {
+		spin_lock(&mm->page_table_lock);
+		BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, address, pmd, _pmd);
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	new_page = *hpage;
+	__collapse_huge_page_copy(pte, new_page, vma, address);
+	__SetPageUptodate(new_page);
+	pgtable = pmd_pgtable(_pmd);
+	VM_BUG_ON(page_count(pgtable) != 1);
+	VM_BUG_ON(page_mapcount(pgtable) != 0);
+
+	_pmd = mk_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+	_pmd = pmd_mkhuge(_pmd);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	BUG_ON(!pmd_none(*pmd));
+	page_add_new_anon_rmap(new_page, vma, address);
+	set_pmd_at(mm, address, pmd, _pmd);
+	update_mmu_cache(vma, address, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes--;
+	spin_unlock(&mm->page_table_lock);
+
+	*hpage = NULL;
+out:
+	up_write(&mm->mmap_sem);
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	int ret = 0, referenced = 0;
+	struct page *page;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	pte = pte_offset_map(pmd, address);
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		pte_t pteval = *_pte;
+		barrier(); /* read from memory */
+		if (!pte_present(pteval) || !pte_write(pteval))
+			goto out_unmap;
+		if (pte_young(pteval))
+			referenced = 1;
+		page = pte_page(pteval);
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		if (!PageLRU(page) || PageLocked(page))
+			goto out_unmap;
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1)
+			goto out_unmap;
+	}
+	if (referenced)
+		ret = 1;
+out_unmap:
+	pte_unmap(pte);
+	if (ret) {
+		up_read(&mm->mmap_sem);
+		collapse_huge_page(mm, address, hpage);
+	}
+out:
+	return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+					    struct page **hpage)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	VM_BUG_ON(!pages);
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_scan.mm_slot)
+		mm_slot = khugepaged_scan.mm_slot;
+	else {
+		mm_slot = list_entry(khugepaged_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		khugepaged_scan.address = 0;
+		khugepaged_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&khugepaged_mm_lock);
+
+	mm = mm_slot->mm;
+	down_read(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, khugepaged_scan.address);
+
+	progress++;
+	for (; vma; vma = vma->vm_next) {
+		unsigned long hstart, hend;
+
+		cond_resched();
+		if (unlikely(khugepaged_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!(vma->vm_flags & VM_HUGEPAGE) &&
+		    !khugepaged_always()) {
+			progress++;
+			continue;
+		}
+		if (!vma->anon_vma || vma->vm_ops) {
+			khugepaged_scan.address = vma->vm_end;
+			progress++;
+			continue;
+		}
+		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+		hend = vma->vm_end & HPAGE_PMD_MASK;
+		if (hstart >= hend) {
+			progress++;
+			continue;
+		}
+		if (khugepaged_scan.address < hstart)
+			khugepaged_scan.address = hstart;
+		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+		while (khugepaged_scan.address < vma->vm_end) {
+			int ret;
+			cond_resched();
+			if (unlikely(khugepaged_test_exit(mm)))
+				goto breakouterloop;
+
+			ret = khugepaged_scan_pmd(mm, vma,
+						  khugepaged_scan.address,
+						  hpage);
+			/* move to next address */
+			khugepaged_scan.address += HPAGE_PMD_SIZE;
+			progress += HPAGE_PMD_NR;
+			if (ret)
+				/* we released mmap_sem so break loop */
+				goto breakouterloop_mmap_sem;
+			if (progress >= pages)
+				goto breakouterloop;
+		}
+	}
+breakouterloop:
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+	spin_lock(&khugepaged_mm_lock);
+	BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (khugepaged_test_exit(mm) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * khugepaged runs here, khugepaged_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+			khugepaged_scan.mm_slot = list_entry(
+				mm_slot->mm_node.next,
+				struct mm_slot, mm_node);
+			khugepaged_scan.address = 0;
+		} else
+			khugepaged_scan.mm_slot = NULL;
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) ||
+		!khugepaged_enabled();
+}
+
+static void khugepaged_do_scan(struct page **hpage)
+{
+	unsigned int progress = 0, pass_through_head = 0;
+	unsigned int pages = khugepaged_pages_to_scan;
+
+	barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+	while (progress < pages) {
+		cond_resched();
+
+		if (!*hpage) {
+			*hpage = alloc_hugepage_defrag();
+			if (unlikely(!*hpage))
+				break;
+		}
+
+		spin_lock(&khugepaged_mm_lock);
+		if (!khugepaged_scan.mm_slot)
+			pass_through_head++;
+		if (khugepaged_has_work() &&
+		    pass_through_head < 2)
+			progress += khugepaged_scan_mm_slot(pages - progress,
+							    hpage);
+		else
+			progress = pages;
+		spin_unlock(&khugepaged_mm_lock);
+	}
+}
+
+static struct page *khugepaged_alloc_hugepage(void)
+{
+	struct page *hpage;
+
+	do {
+		hpage = alloc_hugepage_defrag();
+		if (!hpage)
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_defrag_sleep_millisecs));
+	} while (unlikely(!hpage) &&
+		 likely(khugepaged_enabled()));
+	return hpage;
+}
+
+static void khugepaged_loop(void)
+{
+	struct page *hpage;
+
+	while (likely(khugepaged_enabled())) {
+		hpage = khugepaged_alloc_hugepage();
+		if (unlikely(!hpage))
+			break;
+
+		khugepaged_do_scan(&hpage);
+		if (hpage)
+			put_page(hpage);
+		if (khugepaged_has_work())
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_scan_sleep_millisecs));
+		else if (khugepaged_enabled())
+			wait_event_interruptible(khugepaged_wait,
+						 khugepaged_wait_event());
+	}
+}
+
+static int khugepaged(void *none)
+{
+	struct mm_slot *mm_slot;
+
+	for (;;) {
+		BUG_ON(khugepaged_thread != current);
+		khugepaged_loop();
+		BUG_ON(khugepaged_thread != current);
+
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_enabled())
+			break;
+		mutex_unlock(&khugepaged_mutex);
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = khugepaged_scan.mm_slot;
+	khugepaged_scan.mm_slot = NULL;
+	if (mm_slot)
+		collect_mm_slot(mm_slot);
+	spin_unlock(&khugepaged_mm_lock);
+
+	khugepaged_thread = NULL;
+	mutex_unlock(&khugepaged_mutex);
+
+	return 0;
+}
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -72,7 +72,7 @@ static long madvise_behavior(struct vm_a
 			goto out;
 		break;
 	case MADV_HUGEPAGE:
-		error = hugepage_madvise(&new_flags);
+		error = hugepage_madvise(vma->vm_mm, &new_flags);
 		if (error)
 			goto out;
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 27 of 30] memcg compound
  2010-01-21  6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
@ 2010-01-21  7:07   ` KAMEZAWA Hiroyuki
  2010-01-21 15:44     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-21  7:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 21 Jan 2010 07:20:51 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Teach memcg to charge/uncharge compound pages.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I'm sorry but I'm glad if you don't touch fast path.

if (likely(page_size == PAGE_SIZE))
	if (consume_stock(mem))
		goto charged;

is my recommendation.

Bye.
-Kame


> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1288,15 +1288,20 @@ static atomic_t memcg_drain_count;
>   * cgroup which is not current target, returns false. This stock will be
>   * refilled.
>   */
> -static bool consume_stock(struct mem_cgroup *mem)
> +static bool consume_stock(struct mem_cgroup *mem, int *page_size)
>  {
>  	struct memcg_stock_pcp *stock;
>  	bool ret = true;
>  
>  	stock = &get_cpu_var(memcg_stock);
> -	if (mem == stock->cached && stock->charge)
> -		stock->charge -= PAGE_SIZE;
> -	else /* need to call res_counter_charge */
> +	if (mem == stock->cached && stock->charge) {
> +		if (*page_size > stock->charge) {
> +			*page_size -= stock->charge;
> +			stock->charge = 0;
> +			ret = false;
> +		} else
> +			stock->charge -= *page_size;
> +	} else /* need to call res_counter_charge */
>  		ret = false;
>  	put_cpu_var(memcg_stock);
>  	return ret;
> @@ -1401,13 +1406,13 @@ static int __cpuinit memcg_stock_cpu_cal
>   * oom-killer can be invoked.
>   */
>  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> -			gfp_t gfp_mask, struct mem_cgroup **memcg,
> -			bool oom, struct page *page)
> +				   gfp_t gfp_mask, struct mem_cgroup **memcg,
> +				   bool oom, struct page *page, int page_size)
>  {
>  	struct mem_cgroup *mem, *mem_over_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct res_counter *fail_res;
> -	int csize = CHARGE_SIZE;
> +	int csize = max(page_size, (int) CHARGE_SIZE);
>  
>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -1439,7 +1444,7 @@ static int __mem_cgroup_try_charge(struc
>  		int ret = 0;
>  		unsigned long flags = 0;
>  
> -		if (consume_stock(mem))
> +		if (consume_stock(mem, &page_size))
>  			goto charged;
>  
>  		ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1460,8 +1465,8 @@ static int __mem_cgroup_try_charge(struc
>  									res);
>  
>  		/* reduce request size and retry */
> -		if (csize > PAGE_SIZE) {
> -			csize = PAGE_SIZE;
> +		if (csize > page_size) {
> +			csize = page_size;
>  			continue;
>  		}
>  		if (!(gfp_mask & __GFP_WAIT))
> @@ -1491,8 +1496,8 @@ static int __mem_cgroup_try_charge(struc
>  			goto nomem;
>  		}
>  	}
> -	if (csize > PAGE_SIZE)
> -		refill_stock(mem, csize - PAGE_SIZE);
> +	if (csize > page_size)
> +		refill_stock(mem, csize - page_size);
>  charged:
>  	/*
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> @@ -1512,12 +1517,12 @@ nomem:
>   * This function is for that and do uncharge, put css's refcnt.
>   * gotten by try_charge().
>   */
> -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
>  {
>  	if (!mem_cgroup_is_root(mem)) {
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		res_counter_uncharge(&mem->res, page_size);
>  		if (do_swap_account)
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +			res_counter_uncharge(&mem->memsw, page_size);
>  	}
>  	css_put(&mem->css);
>  }
> @@ -1575,8 +1580,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>   */
>  
>  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> -				     struct page_cgroup *pc,
> -				     enum charge_type ctype)
> +				       struct page_cgroup *pc,
> +				       enum charge_type ctype,
> +				       int page_size)
>  {
>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>  	if (!mem)
> @@ -1585,7 +1591,7 @@ static void __mem_cgroup_commit_charge(s
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
> -		mem_cgroup_cancel_charge(mem);
> +		mem_cgroup_cancel_charge(mem, page_size);
>  		return;
>  	}
>  
> @@ -1722,7 +1728,8 @@ static int mem_cgroup_move_parent(struct
>  		goto put;
>  
>  	parent = mem_cgroup_from_cont(pcg);
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
> +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
> +		PAGE_SIZE);
>  	if (ret || !parent)
>  		goto put_back;
>  
> @@ -1730,7 +1737,7 @@ static int mem_cgroup_move_parent(struct
>  	if (!ret)
>  		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
>  	else
> -		mem_cgroup_cancel_charge(parent);	/* does css_put */
> +		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
>  put_back:
>  	putback_lru_page(page);
>  put:
> @@ -1752,6 +1759,11 @@ static int mem_cgroup_charge_common(stru
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  	int ret;
> +	int page_size = PAGE_SIZE;
> +
> +	VM_BUG_ON(PageTail(page));
> +	if (PageHead(page))
> +		page_size <<= compound_order(page);
>  
>  	pc = lookup_page_cgroup(page);
>  	/* can happen at boot */
> @@ -1760,11 +1772,12 @@ static int mem_cgroup_charge_common(stru
>  	prefetchw(pc);
>  
>  	mem = memcg;
> -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
> +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
> +				      page_size);
>  	if (ret || !mem)
>  		return ret;
>  
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
>  	return 0;
>  }
>  
> @@ -1773,8 +1786,6 @@ int mem_cgroup_newpage_charge(struct pag
>  {
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  	/*
>  	 * If already mapped, we don't have to account.
>  	 * If page cache, page->mapping has address_space.
> @@ -1787,7 +1798,7 @@ int mem_cgroup_newpage_charge(struct pag
>  	if (unlikely(!mm))
>  		mm = &init_mm;
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
>  
>  static void
> @@ -1880,14 +1891,14 @@ int mem_cgroup_try_charge_swapin(struct 
>  	if (!mem)
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
>  	/* drop extra refcnt from tryget */
>  	css_put(&mem->css);
>  	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> -	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
>  }
>  
>  static void
> @@ -1903,7 +1914,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  	cgroup_exclude_rmdir(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
> -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
>  	mem_cgroup_lru_add_after_commit_swapcache(page);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
> @@ -1952,11 +1963,12 @@ void mem_cgroup_cancel_charge_swapin(str
>  		return;
>  	if (!mem)
>  		return;
> -	mem_cgroup_cancel_charge(mem);
> +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
>  }
>  
>  static void
> -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> +	      int page_size)
>  {
>  	struct memcg_batch_info *batch = NULL;
>  	bool uncharge_memsw = true;
> @@ -1989,14 +2001,14 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += PAGE_SIZE;
> +	batch->bytes += page_size;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += PAGE_SIZE;
> +		batch->memsw_bytes += page_size;
>  	return;
>  direct_uncharge:
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, page_size);
>  	if (uncharge_memsw)
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, page_size);
>  	return;
>  }
>  
> @@ -2009,6 +2021,11 @@ __mem_cgroup_uncharge_common(struct page
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
>  	struct mem_cgroup_per_zone *mz;
> +	int page_size = PAGE_SIZE;
> +
> +	VM_BUG_ON(PageTail(page));
> +	if (PageHead(page))
> +		page_size <<= compound_order(page);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2016,6 +2033,8 @@ __mem_cgroup_uncharge_common(struct page
>  	if (PageSwapCache(page))
>  		return NULL;
>  
> +	VM_BUG_ON(PageTail(page));
> +
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> @@ -2048,7 +2067,7 @@ __mem_cgroup_uncharge_common(struct page
>  	}
>  
>  	if (!mem_cgroup_is_root(mem))
> -		__do_uncharge(mem, ctype);
> +		__do_uncharge(mem, ctype, page_size);
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		mem_cgroup_swap_statistics(mem, true);
>  	mem_cgroup_charge_statistics(mem, pc, false);
> @@ -2217,7 +2236,7 @@ int mem_cgroup_prepare_migration(struct 
>  
>  	if (mem) {
>  		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> -						page);
> +					      page, PAGE_SIZE);
>  		css_put(&mem->css);
>  	}
>  	*ptr = mem;
> @@ -2260,7 +2279,7 @@ void mem_cgroup_end_migration(struct mem
>  	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
>  	 * So, double-counting is effectively avoided.
>  	 */
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
>  
>  	/*
>  	 * Both of oldpage and newpage are still under lock_page().
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-21  6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
@ 2010-01-21  7:16   ` KAMEZAWA Hiroyuki
  2010-01-21 16:08     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-21  7:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 21 Jan 2010 07:20:52 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add memcg charge/uncharge to hugepage faults in huge_memory.c.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -212,6 +212,7 @@ static int __do_huge_pmd_anonymous_page(
>  	VM_BUG_ON(!PageCompound(page));
>  	pgtable = pte_alloc_one(mm, address);
>  	if (unlikely(!pgtable)) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -228,6 +229,7 @@ static int __do_huge_pmd_anonymous_page(
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);

Can't we do this put_page() and uncharge() outside of page table lock ?

Thanks,
-Kame

>  	} else {
> @@ -265,6 +267,10 @@ int do_huge_pmd_anonymous_page(struct mm
>  		page = alloc_hugepage(transparent_hugepage_defrag(vma));
>  		if (unlikely(!page))
>  			goto out;
> +		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
> +			put_page(page);
> +			goto out;
> +		}
>  
>  		return __do_huge_pmd_anonymous_page(mm, vma, address, pmd,
>  						    page, haddr);
> @@ -365,9 +371,15 @@ static int do_huge_pmd_wp_page_fallback(
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>  					  vma, address);
> -		if (unlikely(!pages[i])) {
> -			while (--i >= 0)
> +		if (unlikely(!pages[i] ||
> +			     mem_cgroup_newpage_charge(pages[i], mm,
> +						       GFP_KERNEL))) {
> +			if (pages[i])
>  				put_page(pages[i]);
> +			while (--i >= 0) {
> +				mem_cgroup_uncharge_page(pages[i]);
> +				put_page(pages[i]);
> +			}

Maybe we can use batched_uncharge here. As

	mem_cgroup_uncharge_start();
	while (--i) {
		mem_cgroup_uncharge_page(page[i]);
		put_page(pages[i]);
	}
	mem_cgroup_uncharge_end();

Hmm...but this requires some modification to memcontrol.c. Okay, please
leave this as my homework.

 
>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
> @@ -426,8 +438,10 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(&mm->page_table_lock);
> -	for (i = 0; i < HPAGE_PMD_NR; i++)
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		mem_cgroup_uncharge_page(pages[i]);
>  		put_page(pages[i]);
> +	}
here too.

Bye.
-Kame
>  	kfree(pages);
>  	goto out;
>  }
> @@ -469,6 +483,11 @@ int do_huge_pmd_wp_page(struct mm_struct
>  		goto out;
>  	}
>  
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		put_page(new_page);
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
>  	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
> @@ -480,9 +499,10 @@ int do_huge_pmd_wp_page(struct mm_struct
>  	smp_wmb();
>  
>  	spin_lock(&mm->page_table_lock);
> -	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> +		mem_cgroup_uncharge_page(new_page);
>  		put_page(new_page);
> -	else {
> +	} else {
>  		pmd_t entry;
>  		entry = mk_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 27 of 30] memcg compound
  2010-01-21  7:07   ` KAMEZAWA Hiroyuki
@ 2010-01-21 15:44     ` Andrea Arcangeli
  2010-01-21 23:55       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21 15:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, Jan 21, 2010 at 04:07:59PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 21 Jan 2010 07:20:51 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Teach memcg to charge/uncharge compound pages.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> I'm sorry but I'm glad if you don't touch fast path.
> 
> if (likely(page_size == PAGE_SIZE))
> 	if (consume_stock(mem))
> 		goto charged;
> 
> is my recommendation.

Ok updated. But I didn't touch this code since last submit, because I
didn't merge the other patch (not yet in mainline) that you said would
complicate things. So I assume most if it will need to be rewritten. I
also though you wanted to remove the hpage size from the batch logic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-21  7:16   ` KAMEZAWA Hiroyuki
@ 2010-01-21 16:08     ` Andrea Arcangeli
  2010-01-22  0:13       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21 16:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

> > @@ -228,6 +229,7 @@ static int __do_huge_pmd_anonymous_page(
> >  
> >  	spin_lock(&mm->page_table_lock);
> >  	if (unlikely(!pmd_none(*pmd))) {
> > +		mem_cgroup_uncharge_page(page);
> >  		put_page(page);
> >  		pte_free(mm, pgtable);
> 
On Thu, Jan 21, 2010 at 04:16:01PM +0900, KAMEZAWA Hiroyuki wrote:
> Can't we do this put_page() and uncharge() outside of page table lock ?

Yes we can, but it's only a microoptimization because this only
triggers during a controlled race condition across different
threads. But no problem to optimize it...

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -228,6 +228,7 @@ static int __do_huge_pmd_anonymous_page(
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -238,8 +239,8 @@ static int __do_huge_pmd_anonymous_page(
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		prepare_pmd_huge_pte(pgtable, mm);
+		spin_unlock(&mm->page_table_lock);
 	}
-	spin_unlock(&mm->page_table_lock);
 
 	return ret;
 }

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -230,6 +230,7 @@ static int __do_huge_pmd_anonymous_page(
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {


Also below I appended the update memcg_compound to stop using the
batch system. Also note your "likely" I removed it because for KVM
most of the time it'll be TransHugePages to be charged. I prefer
likely/unlikely when it's always a slow path no matter what workload
(assuming useful/optimized workloads only ;). Like said in earlier
email I guess the below may be wasted time because of the rework
coming on the file. Also note the TransHugePage check here it's used
instead of page_size == PAGE_SIZE to eliminate that additional branch
at compile time if TRANSPARENT_HUGEPAGE=n.

Now the only real pain remains in the LRU list accounting, I tried to
solve it but found no clean way that didn't require mess all over
vmscan.c. So for now hugepages in lru are accounted as 4k pages
;). Nothing breaks just stats won't be as useful to the admin...

Subject: memcg compound
From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1401,8 +1401,8 @@ static int __cpuinit memcg_stock_cpu_cal
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg,
-			bool oom, struct page *page)
+				   gfp_t gfp_mask, struct mem_cgroup **memcg,
+				   bool oom, struct page *page, int page_size)
 {
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -1415,6 +1415,9 @@ static int __mem_cgroup_try_charge(struc
 		return 0;
 	}
 
+	if (PageTransHuge(page))
+		csize = page_size;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -1439,8 +1442,9 @@ static int __mem_cgroup_try_charge(struc
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (consume_stock(mem))
-			goto charged;
+		if (!PageTransHuge(page))
+			if (consume_stock(mem))
+				goto charged;
 
 		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
@@ -1460,7 +1464,7 @@ static int __mem_cgroup_try_charge(struc
 									res);
 
 		/* reduce request size and retry */
-		if (csize > PAGE_SIZE) {
+		if (csize > page_size) {
 			csize = PAGE_SIZE;
 			continue;
 		}
@@ -1491,7 +1495,7 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
-	if (csize > PAGE_SIZE)
+	if (csize > page_size)
 		refill_stock(mem, csize - PAGE_SIZE);
 charged:
 	/*
@@ -1512,12 +1516,12 @@ nomem:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, page_size);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, page_size);
 	}
 	css_put(&mem->css);
 }
@@ -1575,8 +1579,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -1585,7 +1590,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -1722,7 +1727,8 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
+		PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
@@ -1730,7 +1736,7 @@ static int mem_cgroup_move_parent(struct
 	if (!ret)
 		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
 	else
-		mem_cgroup_cancel_charge(parent);	/* does css_put */
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
 put_back:
 	putback_lru_page(page);
 put:
@@ -1752,6 +1758,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -1760,11 +1770,12 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
+				      page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -1773,8 +1784,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -1787,7 +1796,7 @@ int mem_cgroup_newpage_charge(struct pag
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 static void
@@ -1880,14 +1889,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
 }
 
 static void
@@ -1903,7 +1912,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -1952,11 +1961,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -1989,14 +1999,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	return;
 }
 
@@ -2009,6 +2019,10 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2048,7 +2062,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -2217,7 +2231,7 @@ int mem_cgroup_prepare_migration(struct 
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
-						page);
+					      page, PAGE_SIZE);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2260,7 +2274,7 @@ void mem_cgroup_end_migration(struct mem
 	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
 	 * So, double-counting is effectively avoided.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 
 	/*
 	 * Both of oldpage and newpage are still under lock_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03 of 30] alter compound get_page/put_page
  2010-01-21  6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
@ 2010-01-21 17:35   ` Dave Hansen
  2010-01-23 17:39     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Hansen @ 2010-01-21 17:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, 2010-01-21 at 07:20 +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Alter compound get_page/put_page to keep references on subpages too, in order
> to allow __split_huge_page_refcount to split an hugepage even while subpages
> have been pinned by one of the get_user_pages() variants.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -43,6 +43,14 @@ static noinline int gup_pte_range(pmd_t 
>  		page = pte_page(pte);
>  		if (!page_cache_get_speculative(page))
>  			return 0;
> +		if (PageTail(page)) {
> +			/*
> +			 * __split_huge_page_refcount() cannot run
> +			 * from under us.
> +			 */
> +			VM_BUG_ON(atomic_read(&page->_count) < 0);
> +			atomic_inc(&page->_count);
> +		}
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  			put_page(page);
>  			return 0;
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -128,6 +128,14 @@ static noinline int gup_huge_pmd(pmd_t p
>  	do {
>  		VM_BUG_ON(compound_head(page) != head);
>  		pages[*nr] = page;
> +		if (PageTail(page)) {
> +			/*
> +			 * __split_huge_page_refcount() cannot run
> +			 * from under us.
> +			 */
> +			VM_BUG_ON(atomic_read(&page->_count) < 0);
> +			atomic_inc(&page->_count);
> +		}

Christoph kinda has a point here.  The gup code is going to be a pretty
hot path for some people, and this does add a bunch of atomics that some
people will have no need for.

It's also a decent place to put a helper function anyway.

void pin_huge_page_tail(struct page *page)
{
	/*
	 * This ensures that a __split_huge_page_refcount()
	 * running underneath us cannot 
	 */
	VM_BUG_ON(atomic_read(&page->_count) < 0);
	atomic_inc(&page->_count);
}

It'll keep us from putting the same comment in too many arches, I guess

>  static inline void get_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));

Hmm.

	if 

>  	atomic_inc(&page->_count);
> +	if (unlikely(PageTail(page))) {
> +		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> +		atomic_inc(&page->first_page->_count);
> +		/* __split_huge_page_refcount can't run under get_page */
> +		VM_BUG_ON(!PageTail(page));
> +	}
>  }

Are you hoping to catch a race in progress with the second VM_BUG_ON()
here?  Maybe the comment should say, "detect race with
__split_huge_page_refcount".

>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -409,7 +409,8 @@ static inline void __ClearPageTail(struc
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> +	 1 << PG_compound_lock)

Nit: should probably go in the last patch.

>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -55,17 +55,80 @@ static void __page_cache_release(struct 
>  		del_page_from_lru(zone, page);
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +}
> +
> +static void __put_single_page(struct page *page)
> +{
> +	__page_cache_release(page);
>  	free_hot_page(page);
>  }
> 
> +static void __put_compound_page(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	__page_cache_release(page);
> +	dtor = get_compound_page_dtor(page);
> +	(*dtor)(page);
> +}
> +
>  static void put_compound_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	if (put_page_testzero(page)) {
> -		compound_page_dtor *dtor;
> -
> -		dtor = get_compound_page_dtor(page);
> -		(*dtor)(page);
> +	if (unlikely(PageTail(page))) {
> +		/* __split_huge_page_refcount can run under us */
> +		struct page *page_head = page->first_page;
> +		smp_rmb();
> +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> +			if (unlikely(!PageHead(page_head))) {
> +				/* PageHead is cleared after PageTail */
> +				smp_rmb();
> +				VM_BUG_ON(PageTail(page));
> +				goto out_put_head;
> +			}
> +			/*
> +			 * Only run compound_lock on a valid PageHead,
> +			 * after having it pinned with
> +			 * get_page_unless_zero() above.
> +			 */
> +			smp_mb();
> +			/* page_head wasn't a dangling pointer */
> +			compound_lock(page_head);
> +			if (unlikely(!PageTail(page))) {
> +				/* __split_huge_page_refcount run before us */
> +				compound_unlock(page_head);
> +			out_put_head:
> +				put_page(page_head);
> +			out_put_single:
> +				if (put_page_testzero(page))
> +					__put_single_page(page);
> +				return;
> +			}
> +			VM_BUG_ON(page_head != page->first_page);
> +			/*
> +			 * We can release the refcount taken by
> +			 * get_page_unless_zero now that
> +			 * split_huge_page_refcount is blocked on the
> +			 * compound_lock.
> +			 */
> +			if (put_page_testzero(page_head))
> +				VM_BUG_ON(1);
> +			/* __split_huge_page_refcount will wait now */
> +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> +			atomic_dec(&page->_count);
> +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> +			if (put_page_testzero(page_head))
> +				__put_compound_page(page_head);
> +			else
> +				compound_unlock(page_head);
> +			return;
> +		} else
> +			/* page_head is a dangling pointer */
> +			goto out_put_single;
> +	} else if (put_page_testzero(page)) {
> +		if (PageHead(page))
> +			__put_compound_page(page);
> +		else
> +			__put_single_page(page);
>  	}
>  }

That looks functional to me, although the code is pretty darn dense. :)
But, I'm not sure there's a better way to do it.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04 of 30] clear compound mapping
  2010-01-21  6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
@ 2010-01-21 17:43   ` Dave Hansen
  2010-01-23 17:55     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Hansen @ 2010-01-21 17:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, 2010-01-21 at 07:20 +0100, Andrea Arcangeli wrote:
> Clear compound mapping for anonymous compound pages like it already happens for
> regular anonymous pages.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -584,6 +584,8 @@ static void __free_pages_ok(struct page 
> 
>  	kmemcheck_free_shadow(page, order);
> 
> +	if (PageAnon(page))
> +		page->mapping = NULL;
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
>  	if (bad)

This one may at least need a bit of an enhanced patch description.  I
didn't immediately remember that __free_pages_ok() is only actually
called for compound pages.

Would it make more sense to pull the page->mapping=NULL out of
free_hot_cold_page(), and just put a single one in __free_pages()?

I guess we'd also need one in free_compound_page() since it calls
__free_pages_ok() directly.  But, if this patch were putting modifying
free_compound_page() it would at least be super obvious what was going
on.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11 of 30] add pmd mangling functions to x86
  2010-01-21  6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
@ 2010-01-21 17:47   ` Dave Hansen
  2010-01-21 19:14     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Hansen @ 2010-01-21 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, 2010-01-21 at 07:20 +0100, Andrea Arcangeli wrote:
> @@ -351,7 +410,7 @@ static inline unsigned long pmd_page_vad
>   * Currently stuck as a macro due to indirect forward reference to
>   * linux/mmzone.h's __section_mem_map_addr() definition:
>   */
> -#define pmd_page(pmd)  pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> +#define pmd_page(pmd)  pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)

Is there some new use of the high pmd bits or something?  I'm a bit
confused why this is getting modified.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11 of 30] add pmd mangling functions to x86
  2010-01-21 17:47   ` Dave Hansen
@ 2010-01-21 19:14     ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21 19:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Jan 21, 2010 at 09:47:56AM -0800, Dave Hansen wrote:
> On Thu, 2010-01-21 at 07:20 +0100, Andrea Arcangeli wrote:
> > @@ -351,7 +410,7 @@ static inline unsigned long pmd_page_vad
> >   * Currently stuck as a macro due to indirect forward reference to
> >   * linux/mmzone.h's __section_mem_map_addr() definition:
> >   */
> > -#define pmd_page(pmd)  pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > +#define pmd_page(pmd)  pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> 
> Is there some new use of the high pmd bits or something?  I'm a bit
> confused why this is getting modified.

The NX bit is properly enabled on the huge pmd too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22 of 30] pmd_trans_huge migrate bugcheck
  2010-01-21  6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
@ 2010-01-21 20:40   ` Christoph Lameter
  2010-01-21 23:01     ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2010-01-21 20:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright

On Thu, 21 Jan 2010, Andrea Arcangeli wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
>
> No pmd_trans_huge should ever materialize in migration ptes areas, because
> try_to_unmap will split the hugepage before migration ptes are instantiated.

try_to_unmap? How do you isolate the hugepages from the LRU? If you do
isolate the huge pages via a LRU and get a 2M page then the migration
logic has to be modified to be aware that huge pages may split during try_to_unmap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22 of 30] pmd_trans_huge migrate bugcheck
  2010-01-21 20:40   ` Christoph Lameter
@ 2010-01-21 23:01     ` Andrea Arcangeli
  2010-01-21 23:17       ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21 23:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, Jan 21, 2010 at 02:40:41PM -0600, Christoph Lameter wrote:
> On Thu, 21 Jan 2010, Andrea Arcangeli wrote:
> 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> >
> > No pmd_trans_huge should ever materialize in migration ptes areas, because
> > try_to_unmap will split the hugepage before migration ptes are instantiated.
> 
> try_to_unmap? How do you isolate the hugepages from the LRU? If you do
> isolate the huge pages via a LRU and get a 2M page then the migration
> logic has to be modified to be aware that huge pages may split during try_to_unmap.

Good point, all we need to do is to add one split_huge_page before
isolate_lru_page, the one in try_to_unmap isn't enough. Effectively I
guess I can remove the one in try_to_unmap then and replace it with
BUG_ON(TransHugePage(page)).

Subject: pmd_trans_huge migrate

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -99,6 +99,7 @@ static int remove_migration_pte(struct p
 		goto out;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		goto out;
 
@@ -833,6 +834,9 @@ static int do_move_page_to_node_array(st
 				!migrate_all)
 			goto put_and_set;
 
+		if (unlikely(PageTransHuge(page)))
+			if (unlikely(split_huge_page(page)))
+				goto put_and_set;
 		err = isolate_lru_page(page);
 		if (!err) {
 			list_add_tail(&page->lru, &pagelist);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 22 of 30] pmd_trans_huge migrate bugcheck
  2010-01-21 23:01     ` Andrea Arcangeli
@ 2010-01-21 23:17       ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-21 23:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, Jan 22, 2010 at 12:01:27AM +0100, Andrea Arcangeli wrote:
> @@ -833,6 +834,9 @@ static int do_move_page_to_node_array(st
>  				!migrate_all)
>  			goto put_and_set;
>  
> +		if (unlikely(PageTransHuge(page)))
> +			if (unlikely(split_huge_page(page)))
> +				goto put_and_set;
>  		err = isolate_lru_page(page);
>  		if (!err) {
>  			list_add_tail(&page->lru, &pagelist);

This was too fast of a patch, I've to move this a few lines above so
the mapcount check will work too (also note, pagetranshuge bugs on
tail pages and I like to keep it that way to be more strict on the
other users, so it should be replaced by pagecompound in addition to
moving it a little up). refcounting will adjust automatically and
atomically during the split, simply mapcount will be >0 after the split
on the tailpage and the tail_page->_count will be boosted by the mapcount too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 27 of 30] memcg compound
  2010-01-21 15:44     ` Andrea Arcangeli
@ 2010-01-21 23:55       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-21 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 21 Jan 2010 16:44:08 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Thu, Jan 21, 2010 at 04:07:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 21 Jan 2010 07:20:51 +0100
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Teach memcg to charge/uncharge compound pages.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > I'm sorry but I'm glad if you don't touch fast path.
> > 
> > if (likely(page_size == PAGE_SIZE))
> > 	if (consume_stock(mem))
> > 		goto charged;
> > 
> > is my recommendation.
> 
> Ok updated. But I didn't touch this code since last submit, because I
> didn't merge the other patch (not yet in mainline) that you said would
> complicate things. So I assume most if it will need to be rewritten. I
> also though you wanted to remove the hpage size from the batch logic.
> 
I see. Thank you.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-21 16:08     ` Andrea Arcangeli
@ 2010-01-22  0:13       ` KAMEZAWA Hiroyuki
  2010-01-27 11:27         ` Balbir Singh
  0 siblings, 1 reply; 79+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-22  0:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 21 Jan 2010 17:08:07 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> > > @@ -228,6 +229,7 @@ static int __do_huge_pmd_anonymous_page(
> > >  
> > >  	spin_lock(&mm->page_table_lock);
> > >  	if (unlikely(!pmd_none(*pmd))) {
> > > +		mem_cgroup_uncharge_page(page);
> > >  		put_page(page);
> > >  		pte_free(mm, pgtable);
> > 
> On Thu, Jan 21, 2010 at 04:16:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > Can't we do this put_page() and uncharge() outside of page table lock ?
> 
> Yes we can, but it's only a microoptimization because this only
> triggers during a controlled race condition across different
> threads. But no problem to optimize it...
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -228,6 +228,7 @@ static int __do_huge_pmd_anonymous_page(
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -238,8 +239,8 @@ static int __do_huge_pmd_anonymous_page(
>  		page_add_new_anon_rmap(page, vma, haddr);
>  		set_pmd_at(mm, haddr, pmd, entry);
>  		prepare_pmd_huge_pte(pgtable, mm);
> +		spin_unlock(&mm->page_table_lock);
>  	}
> -	spin_unlock(&mm->page_table_lock);
>  
>  	return ret;
>  }
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -230,6 +230,7 @@ static int __do_huge_pmd_anonymous_page(
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
>  		spin_unlock(&mm->page_table_lock);
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> 
> 
> Also below I appended the update memcg_compound to stop using the
> batch system. Also note your "likely" I removed it because for KVM
> most of the time it'll be TransHugePages to be charged. I prefer
> likely/unlikely when it's always a slow path no matter what workload
> (assuming useful/optimized workloads only ;). 
Hmm.. But I never believe KVM will be "likley" case in a few years.

> Like said in earlier
> email I guess the below may be wasted time because of the rework
> coming on the file. Also note the TransHugePage check here it's used
> instead of page_size == PAGE_SIZE to eliminate that additional branch
> at compile time if TRANSPARENT_HUGEPAGE=n.
> 
seems nice.

> Now the only real pain remains in the LRU list accounting, I tried to
> solve it but found no clean way that didn't require mess all over
> vmscan.c. So for now hugepages in lru are accounted as 4k pages
> ;). Nothing breaks just stats won't be as useful to the admin...
> 
Hmm, interesting/important problem...I keep it in my mind.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2010-01-21  6:20 ` [PATCH 30 of 30] khugepaged Andrea Arcangeli
@ 2010-01-22 14:46 ` Christoph Lameter
  2010-01-22 15:19   ` Andrea Arcangeli
  2010-01-26 11:24 ` Mel Gorman
  31 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2010-01-22 14:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright

Jus thinking about yesterdays fix to page migration:

This means that huge pages are unstable right? Kernel code cannot
establish a reference to a 2M/4M page and be sure that the page is not
broken up due to something in the VM that cannot handle huge pages?

We need special locking for this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
@ 2010-01-22 15:19   ` Andrea Arcangeli
  2010-01-22 16:51     ` Christoph Lameter
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-22 15:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, Jan 22, 2010 at 08:46:50AM -0600, Christoph Lameter wrote:
> Jus thinking about yesterdays fix to page migration:
> 
> This means that huge pages are unstable right? Kernel code cannot
> establish a reference to a 2M/4M page and be sure that the page is not
> broken up due to something in the VM that cannot handle huge pages?

Physically speaking DMA-wise they cannot be broken up, only thing that
gets broken up is the pmd that instead of mapping the page directly
starts to map the pte. Nothing changes on the physical side of
hugepages. khugepaged only collapse pages into hugepages if there are
no references at all (no gup no nothing) so again no issue DMA-wise.

> We need special locking for this?

Only special locking is to take page_table_lock if pmd_trans_huge is
set. pmd_trans_huge cannot appear from under us because we either hold
the mmap_sem in read mode, or we have the PG_lock, or in gup_fast we
have irq disabled so the ipi of collapse_huge_page will wait. It's all
handled transparently by the patch, you won't notice you're dealing
with hugepage if you're gup user (unless you use gup to migrate pages
in which case calling split_huge_page is enough like in patch ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-22 15:19   ` Andrea Arcangeli
@ 2010-01-22 16:51     ` Christoph Lameter
  2010-01-23 17:58       ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2010-01-22 16:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, 22 Jan 2010, Andrea Arcangeli wrote:

> On Fri, Jan 22, 2010 at 08:46:50AM -0600, Christoph Lameter wrote:
> > Jus thinking about yesterdays fix to page migration:
> >
> > This means that huge pages are unstable right? Kernel code cannot
> > establish a reference to a 2M/4M page and be sure that the page is not
> > broken up due to something in the VM that cannot handle huge pages?
>
> Physically speaking DMA-wise they cannot be broken up, only thing that
> gets broken up is the pmd that instead of mapping the page directly
> starts to map the pte. Nothing changes on the physical side of
> hugepages. khugepaged only collapse pages into hugepages if there are
> no references at all (no gup no nothing) so again no issue DMA-wise.

Reclaim cannot kick out page size pieces of the huge page?

> have irq disabled so the ipi of collapse_huge_page will wait. It's all
> handled transparently by the patch, you won't notice you're dealing
> with hugepage if you're gup user (unless you use gup to migrate pages
> in which case calling split_huge_page is enough like in patch ;).

What if I want to use hugepages for some purpose and I dont want to use
512 pointers to keep track of the individual pieces?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03 of 30] alter compound get_page/put_page
  2010-01-21 17:35   ` Dave Hansen
@ 2010-01-23 17:39     ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-23 17:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Jan 21, 2010 at 09:35:46AM -0800, Dave Hansen wrote:
> Christoph kinda has a point here.  The gup code is going to be a pretty
> hot path for some people, and this does add a bunch of atomics that some
> people will have no need for.
> 
> It's also a decent place to put a helper function anyway.
> 
> void pin_huge_page_tail(struct page *page)
> {
> 	/*
> 	 * This ensures that a __split_huge_page_refcount()
> 	 * running underneath us cannot 
> 	 */
> 	VM_BUG_ON(atomic_read(&page->_count) < 0);
> 	atomic_inc(&page->_count);
> }
> 
> It'll keep us from putting the same comment in too many arches, I guess

We can replace the compound_lock with a branch, by setting a
PG_trans_huge on all compound pages allocated by huge_memory.c, that
would only benefit gup on hugetlbfs (and it'll add the cost of one
branch to gup on transparent hugepages, that's why I didn't do
that). But I can add it. Note the compound_lock is granular on a
cacheline already hot and exclusive read-write on the l1 cache, not
like the mmap_sem (that gup_fast avoids), but surely an atomic op is
more costly than just a branch...

> >  static inline void get_page(struct page *page)
> >  {
> > -	page = compound_head(page);
> > -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> > +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> 
> Hmm.

This means, if the page is not a tail page, count must be >= 1 (,
which is more strict and more correct than the already existing check
== 0 that should really be <= 0). If a page is a tail page, then the
bugcheck is only for < 0, because tail pages are only pinned by gup
and if there is no gup going on, there is no pin either on tail pages.

> 
> 	if 
> 
> >  	atomic_inc(&page->_count);
> > +	if (unlikely(PageTail(page))) {
> > +		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> > +		atomic_inc(&page->first_page->_count);
> > +		/* __split_huge_page_refcount can't run under get_page */
> > +		VM_BUG_ON(!PageTail(page));
> > +	}
> >  }
> 
> Are you hoping to catch a race in progress with the second VM_BUG_ON()
> here?  Maybe the comment should say, "detect race with
> __split_huge_page_refcount".

Exactly. I think the current comment was explicit enough. But frankly
this is pure paranoid and I'm thinking that gcc can eliminate the
bugcheck entirely because atomic_inc doesn't clobber "memory" so I'll
remove the bugcheck instead, but leaving the current comment.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -322,10 +322,13 @@ static inline void get_page(struct page 
 	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
 	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
 		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
 		atomic_inc(&page->first_page->_count);
-		/* __split_huge_page_refcount can't run under get_page */
-		VM_BUG_ON(!PageTail(page));
 	}
 }
 


> >  static inline struct page *virt_to_head_page(const void *x)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -409,7 +409,8 @@ static inline void __ClearPageTail(struc
> >  	 1 << PG_private | 1 << PG_private_2 | \
> >  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
> >  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> > -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
> > +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> > +	 1 << PG_compound_lock)
> 
> Nit: should probably go in the last patch.

Why? If you apply this single patch we already want to immediately
detect if somebody is running compund_lock but forgetting to
compound_unlock before freeing the page. Just like with PG_lock. There
may be other nits on how I tried to splited the original monolith
without having to rewrite lots of intermediate code, but this looks
ok or at least I don't get why to move it elsewhere ;).

> That looks functional to me, although the code is pretty darn dense. :)
> But, I'm not sure there's a better way to do it.

I'm not sure either.

If you or Christoph or anybody else asks me to add a PG_trans_huge set
by huge_memory.c immediately after allocating the hugepage, and to
make the above put_page/get_page tail pinning and compound_lock
entirely conditional to PG_trans_huge being set I'll do it
immediately. As said it will replace around 2 atomic ops on each
gup/put_page run on a tail page allocated in hugetlbfs (not through
the transparent hugepage framework) with a branch, so it will
practically eliminate the overhead caused to O_DIRECT over
hugetlbfs. I'm not doing it unless explicitly asked because:

1) it will make code even a little more dense

2) it will microslowdown transparent hugepage gup (which means
O_DIRECT over transparent hugepage and the kvm minor fault will have
to pay one more branch than necessary)

It might be a worthwhile tradeoff but I'm not big believer in
hugetlbfs optimization (unless they're entirely self contained) so
that's why I'm not inclined to do it unless explicitly asked. I think
we should rather think on how to speedup gup on transparent hugepage,
and secondly we should add transparent hugepage support starting with
tmpfs probably.

As you guessed, I also couldn't think of a more efficient way than to
use this compound_lock on tail pages to allow the proper atomic adjust
of the tail page refcounts in __split_huge_page_refcount.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04 of 30] clear compound mapping
  2010-01-21 17:43   ` Dave Hansen
@ 2010-01-23 17:55     ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-23 17:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Jan 21, 2010 at 09:43:30AM -0800, Dave Hansen wrote:
> On Thu, 2010-01-21 at 07:20 +0100, Andrea Arcangeli wrote:
> > Clear compound mapping for anonymous compound pages like it already happens for
> > regular anonymous pages.
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -584,6 +584,8 @@ static void __free_pages_ok(struct page 
> > 
> >  	kmemcheck_free_shadow(page, order);
> > 
> > +	if (PageAnon(page))
> > +		page->mapping = NULL;
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> >  	if (bad)
> 
> This one may at least need a bit of an enhanced patch description.  I
> didn't immediately remember that __free_pages_ok() is only actually
> called for compound pages.

In short the problem is that the mapping is only cleared if page is
freed through free_hot_cold_page. So we've to clear it also if we
don't pass through free_hot_cold_page.

> Would it make more sense to pull the page->mapping=NULL out of
> free_hot_cold_page(), and just put a single one in __free_pages()?
>
> I guess we'd also need one in free_compound_page() since it calls
> __free_pages_ok() directly.  But, if this patch were putting modifying
> free_compound_page() it would at least be super obvious what was going
> on.

I could as well set_compound_page_dtor and have my own callback that
calls free_compound_page. Or I can move it to __free_one_page and
remove the one from free_hot_cold_page. What do you prefer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-22 16:51     ` Christoph Lameter
@ 2010-01-23 17:58       ` Andrea Arcangeli
  2010-01-25 21:50         ` Christoph Lameter
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-23 17:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, Jan 22, 2010 at 10:51:35AM -0600, Christoph Lameter wrote:
> On Fri, 22 Jan 2010, Andrea Arcangeli wrote:
> 
> > On Fri, Jan 22, 2010 at 08:46:50AM -0600, Christoph Lameter wrote:
> > > Jus thinking about yesterdays fix to page migration:
> > >
> > > This means that huge pages are unstable right? Kernel code cannot
> > > establish a reference to a 2M/4M page and be sure that the page is not
> > > broken up due to something in the VM that cannot handle huge pages?
> >
> > Physically speaking DMA-wise they cannot be broken up, only thing that
> > gets broken up is the pmd that instead of mapping the page directly
> > starts to map the pte. Nothing changes on the physical side of
> > hugepages. khugepaged only collapse pages into hugepages if there are
> > no references at all (no gup no nothing) so again no issue DMA-wise.
> 
> Reclaim cannot kick out page size pieces of the huge page?

Before the VM can kick out any hugepage it has to split it, then each
page-sized-piece will be considered individually, so reclaim only
kicks out page-sized-pieces of the hugepage.

> > have irq disabled so the ipi of collapse_huge_page will wait. It's all
> > handled transparently by the patch, you won't notice you're dealing
> > with hugepage if you're gup user (unless you use gup to migrate pages
> > in which case calling split_huge_page is enough like in patch ;).
> 
> What if I want to use hugepages for some purpose and I dont want to use
> 512 pointers to keep track of the individual pieces?

If you use hugepages and there's no VM activity or other activity that
triggers split_huge_page, there are no 512 pointers, but just 1
pointer in the pmd to the hugepage, and no other link at all. There is
also one preallocated uninitialized all-zero pte queued in the mm in
case we have to split the hugepage later but it has no pointers to the
hugepage at all (it will have those only if the page is splitted later
for some reason, and then the pmd will point the preallocated pte
instead of the hugepage directly).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-23 17:58       ` Andrea Arcangeli
@ 2010-01-25 21:50         ` Christoph Lameter
  2010-01-25 22:46           ` Andrea Arcangeli
  2010-01-26  0:52           ` Rik van Riel
  0 siblings, 2 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-25 21:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Sat, 23 Jan 2010, Andrea Arcangeli wrote:

> > > hugepages. khugepaged only collapse pages into hugepages if there are
> > > no references at all (no gup no nothing) so again no issue DMA-wise.
> >
> > Reclaim cannot kick out page size pieces of the huge page?
>
> Before the VM can kick out any hugepage it has to split it, then each
> page-sized-piece will be considered individually, so reclaim only
> kicks out page-sized-pieces of the hugepage.

So yes.... Sigh.

> > > have irq disabled so the ipi of collapse_huge_page will wait. It's all
> > > handled transparently by the patch, you won't notice you're dealing
> > > with hugepage if you're gup user (unless you use gup to migrate pages
> > > in which case calling split_huge_page is enough like in patch ;).
> >
> > What if I want to use hugepages for some purpose and I dont want to use
> > 512 pointers to keep track of the individual pieces?
>
> If you use hugepages and there's no VM activity or other activity that
> triggers split_huge_page, there are no 512 pointers, but just 1
> pointer in the pmd to the hugepage, and no other link at all. There is

There is always VM activity, so we need 512 pointers sigh.

So its not possible to use these "huge" pages in a useful way inside of
the kernel. They are volatile and temporary.

In short they cannot be treated as 2M entities unless we add some logic to
prevent splitting.

Frankly this seems to be adding splitting that cannot be used if one
really wants to use large pages for something.

I still think we should get transparent huge page support straight up
first without complicated fallback schemes that makes huge pages difficult
to use.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-25 21:50         ` Christoph Lameter
@ 2010-01-25 22:46           ` Andrea Arcangeli
  2010-01-26 15:47             ` Christoph Lameter
  2010-01-26  0:52           ` Rik van Riel
  1 sibling, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-25 22:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Mon, Jan 25, 2010 at 03:50:31PM -0600, Christoph Lameter wrote:
> There is always VM activity, so we need 512 pointers sigh.

well you said some week ago that actual systems never swap and swap is
useless... if they don't swap there will be just 1 pointer in the
pmd. The mprotect/mremap we want to learn using pmd_trans_huge
natively without split but again, this is incremental work.

> So its not possible to use these "huge" pages in a useful way inside of
> the kernel. They are volatile and temporary.

They are so useless that firefox never splits them, this is my
laptop. khugepaged running so if there's swapout, after swapin they
will be collapsed back into hugepages.

AnonPages:        357148 kB
AnonHugePages:     53248 kB

> In short they cannot be treated as 2M entities unless we add some logic to
> prevent splitting.

They can on the physical side, splitting only involves the virtual
side, this is why O_DIRECT DMA through gup already works on hugepages
without splitting them.

> Frankly this seems to be adding splitting that cannot be used if one
> really wants to use large pages for something.
> 
> I still think we should get transparent huge page support straight up
> first without complicated fallback schemes that makes huge pages difficult
> to use.

Just send me patches to remove all callers of split_huge_page, then
split_huge_page can go away too. But saying that hugepages aren't
useful already is absurd, kvm with "madvise" default of sysfs already
gets the full benefit, nothing more can be achieved by kvm in
performance and functionality than what my patch delivers already
(ok swapping will be a little more efficient if done through 2M I/O
but swap performance isn't so critical). Our objective is to over time
eliminate the need of split_huge_page. khugepaged will remain required
forever, unless the whole kernel ram will become relocatable and
defrag not just an heuristic but a guarantee (it is needed after one
VM exits and release several gigs of hugepages, so the other VM get
the speedup).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-25 21:50         ` Christoph Lameter
  2010-01-25 22:46           ` Andrea Arcangeli
@ 2010-01-26  0:52           ` Rik van Riel
  2010-01-26  6:53             ` Gleb Natapov
  2010-01-26 15:54             ` Christoph Lameter
  1 sibling, 2 replies; 79+ messages in thread
From: Rik van Riel @ 2010-01-26  0:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On 01/25/2010 04:50 PM, Christoph Lameter wrote:

> So its not possible to use these "huge" pages in a useful way inside of
> the kernel. They are volatile and temporary.

> In short they cannot be treated as 2M entities unless we add some logic to
> prevent splitting.
>
> Frankly this seems to be adding splitting that cannot be used if one
> really wants to use large pages for something.

What exactly do you need the stable huge pages for?

Do you have anything specific in mind that we should take
into account?

Want to send in an incremental patch that can temporarily block
the pageout code from splitting up a huge page, so your direct
users of huge pages can rely on them sticking around until the
transaction is done?

> I still think we should get transparent huge page support straight up
> first without complicated fallback schemes that makes huge pages difficult
> to use.

Without swapping, they will become difficult to use for system
administrators, at least in the workloads we care about.

I understand that your workloads may be different.

Please tell us what you need, instead of focussing on what you
don't want, and we may be able to keep the code in such a shape
that you can easily add your functionality.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26  0:52           ` Rik van Riel
@ 2010-01-26  6:53             ` Gleb Natapov
  2010-01-26 12:35               ` Andrea Arcangeli
  2010-01-26 15:54             ` Christoph Lameter
  1 sibling, 1 reply; 79+ messages in thread
From: Gleb Natapov @ 2010-01-26  6:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton

On Mon, Jan 25, 2010 at 07:52:16PM -0500, Rik van Riel wrote:
> On 01/25/2010 04:50 PM, Christoph Lameter wrote:
> 
> >So its not possible to use these "huge" pages in a useful way inside of
> >the kernel. They are volatile and temporary.
> 
> >In short they cannot be treated as 2M entities unless we add some logic to
> >prevent splitting.
> >
> >Frankly this seems to be adding splitting that cannot be used if one
> >really wants to use large pages for something.
> 
> What exactly do you need the stable huge pages for?
> 
> Do you have anything specific in mind that we should take
> into account?
> 
> Want to send in an incremental patch that can temporarily block
> the pageout code from splitting up a huge page, so your direct
> users of huge pages can rely on them sticking around until the
> transaction is done?
> 
Shouldn't mlock() do the trick?

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
@ 2010-01-26 11:24 ` Mel Gorman
  31 siblings, 0 replies; 79+ messages in thread
From: Mel Gorman @ 2010-01-26 11:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Jan 21, 2010 at 07:20:24AM +0100, Andrea Arcangeli wrote:
> - I need to integrate Mel's memory compation code to be used by khugepaged and
>   by the page faults if "defrag" sysfs file setting requires it. His results
>   (especially with the bug fixes that decreased reclaim a lot) looks promising.
> 

As a heads-up on this. I have a V2 ready for testing but no test machines
available to start any of the tests until the weekend at the earliest. There
are no major changes though. Some cleanup, sysfs prototype and the like. The
basic mechanics are the same.

> - likely we'll need a slab front allocator too allocating in 2m chunks, but
>   this should be re-evaluated after merging Mel's work, maybe he already did
>   that.
> 

He didn't but it occurs to me that it could be tested with SLUB by mucking
around with the min_order parameters.

> - khugepaged isn't yet capable of merging readonly shared anon pages, that
>   isn't needed by KVM (KVM uses MADV_DONTFORK) but it might want to learn it
>   for other apps
> 
> - khugepaged should also learn to skip the copy and collapse the hugepage
>   in-place, if possible (to undo the effect of surpious split_huge_page)
> 
> I'm leaving this under a continous stress with scan_sleep_millisecs and
> defrag_sleep_millisecs set to 0 and a 5G swap storm + ~4G in ram. The swap storm
> before settling in pause() will call madvise to split all hugepages in ram and
> then it will run a further memset again to swapin everything a second time.
> Eventually it will settle and khugepaged will remerge as many hugepages as
> they're fully mapped in userland (mapped as swapcache is ok, but khugepaged
> will not trigger swapin I/O or swapcache minor fault) if there are enough not
> fragmented hugepages available.
> 
> This is shortly after start.
> 
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 0  5 3938052  67688    208   3684 1219 2779  1239  2779  396 1061  1  5 75 20
> 2  5 3937092  61612    208   3712 25120 24112 25120 24112 7420 5396  0  8 44 48
> 0  5 3932116  55536    208   3780 26444 21468 26444 21468 7532 5399  0  8 52 40
> 0  5 3927264  46724    208   3296 28208 22528 28328 22528 7871 5722  0  7 52 41
> AnonPages:       1751352 kB
> AnonHugePages:   2021376 kB
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 0  5 3935604  58092    208   3864 1233 2787  1253  2787  400 1061  1  5 74 20
> 0  5 3933924  54248    208   3508 23748 23548 23904 23548 7112 4829  0  6 49 45
> 1  4 3937708  60696    208   3704 24680 28680 24760 28680 7034 5112  0  8 50 42
> 1  4 3934508  59084    208   3304 24096 21020 24156 21020 6832 5015  0  7 48 46
> AnonPages:       1746296 kB
> AnonHugePages:   2023424 kB
> 
> this is after it settled and it's waiting in pause(). khugepaged when it's not
> copying with defrag_sleep/scan_sleep both = 0, just trigers a
> superoverschedule, but as you can see it's extremely low overhead, only taking
> 8% of 4 cores or 32% of 1 core. Likely most of the cpu is taking by schedule().
> So you can imagine how low overhead it is when sleep is set to a "production"
> level and not stress test level. Default sleep is 10seconds and not 2usec...
> 
> 1  0 5680228 106028    396   5060    0    0     0     0  534 341005  0  8 92  0
> 1  0 5680228 106028    396   5060    0    0     0     0  517 349159  0  9 91  0
> 1  0 5680228 106028    396   5060    0    0     0     0  518 346356  0  6 94  0
> 0  0 5680228 106028    396   5060    0    0     0     0  511 348478  0  8 92  0
> AnonPages:        392396 kB
> AnonHugePages:   3371008 kB
> 
> So it looks good so far.
> 
> I think it's probably time to port the patchset to mmotd.
> Further review welcome!
> 
> Andrea
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26  6:53             ` Gleb Natapov
@ 2010-01-26 12:35               ` Andrea Arcangeli
  2010-01-26 15:55                 ` Christoph Lameter
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 12:35 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Rik van Riel, Christoph Lameter, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton

On Tue, Jan 26, 2010 at 08:53:03AM +0200, Gleb Natapov wrote:
> On Mon, Jan 25, 2010 at 07:52:16PM -0500, Rik van Riel wrote:
> > On 01/25/2010 04:50 PM, Christoph Lameter wrote:
> > 
> > >So its not possible to use these "huge" pages in a useful way inside of
> > >the kernel. They are volatile and temporary.
> > 
> > >In short they cannot be treated as 2M entities unless we add some logic to
> > >prevent splitting.
> > >
> > >Frankly this seems to be adding splitting that cannot be used if one
> > >really wants to use large pages for something.
> > 
> > What exactly do you need the stable huge pages for?
> > 
> > Do you have anything specific in mind that we should take
> > into account?
> > 
> > Want to send in an incremental patch that can temporarily block
> > the pageout code from splitting up a huge page, so your direct
> > users of huge pages can rely on them sticking around until the
> > transaction is done?
> > 
> Shouldn't mlock() do the trick?

gup already does the trick of preventing swapping of only the pieces
that are pinned. But it's ok only for temporary direct access like
DMA, ideally if the access to the page can be stopped synchronously
and the mapping is longstanding (not something dma can do, so O_DIRECT
can't do) mmu notifier should be used to allow paging of the page and
teardown the secondary mmu mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-25 22:46           ` Andrea Arcangeli
@ 2010-01-26 15:47             ` Christoph Lameter
  2010-01-26 16:11               ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 15:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Mon, 25 Jan 2010, Andrea Arcangeli wrote:

> On Mon, Jan 25, 2010 at 03:50:31PM -0600, Christoph Lameter wrote:
> > There is always VM activity, so we need 512 pointers sigh.
>
> well you said some week ago that actual systems never swap and swap is
> useless... if they don't swap there will be just 1 pointer in the
> pmd. The mprotect/mremap we want to learn using pmd_trans_huge
> natively without split but again, this is incremental work.

I have to disable swap to be able to make use of these huge pages?

> > So its not possible to use these "huge" pages in a useful way inside of
> > the kernel. They are volatile and temporary.
>
> They are so useless that firefox never splits them, this is my
> laptop. khugepaged running so if there's swapout, after swapin they
> will be collapsed back into hugepages.

Just because your configuration did not split does not mean that there
is a guarantee of them not splitting. You need to guarantee that the VM
does not split them in order to be able to safely refer to them from
code (like I/O paths).

> > In short they cannot be treated as 2M entities unless we add some logic to
> > prevent splitting.
>
> They can on the physical side, splitting only involves the virtual
> side, this is why O_DIRECT DMA through gup already works on hugepages
> without splitting them.

Earlier you stated that reclaim can remove 4k pieces of huge pages after a
split. How does gup keep the huge pages stable while doing I/O? Does gup
submit 512 pointers to 4k chunks or 1 pointer to a 2M chunk?

> Just send me patches to remove all callers of split_huge_page, then
> split_huge_page can go away too. But saying that hugepages aren't
> useful already is absurd, kvm with "madvise" default of sysfs already
> gets the full benefit, nothing more can be achieved by kvm in

This implementation seems to only address the TLB pressure issue
but not the scaling issue that arises because we have to handle data in
4k chunks (512 4k pointers instead of one 2M pointer). Scaling is not
addressed because complex fallback logic sabotages a basic benefit of
huge pages.

> performance and functionality than what my patch delivers already
> (ok swapping will be a little more efficient if done through 2M I/O
> but swap performance isn't so critical). Our objective is to over time
> eliminate the need of split_huge_page. khugepaged will remain required

Ok then establish some way to make these huge pages stable.

> forever, unless the whole kernel ram will become relocatable and
> defrag not just an heuristic but a guarantee (it is needed after one
> VM exits and release several gigs of hugepages, so the other VM get
> the speedup).

That all depends on what you mean by guarantee I guess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26  0:52           ` Rik van Riel
  2010-01-26  6:53             ` Gleb Natapov
@ 2010-01-26 15:54             ` Christoph Lameter
  2010-01-26 16:16               ` Andrea Arcangeli
  2010-01-26 23:07               ` Rik van Riel
  1 sibling, 2 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 15:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Mon, 25 Jan 2010, Rik van Riel wrote:

> What exactly do you need the stable huge pages for?

Reduce VM overhead that arises because we have to handle memory in 4k
chunks. I.e. we have to submit 512 descriptors for 4k page sized chunks to
do I/O. get_user_pages has to pin 512 pages to get a safe reference.
Reclaim has to scan 4k chunks of memory. As the amount of memory increases
so does the number of metadatachunks that have to be handled by the VM and
the I/O subsystem.

> Want to send in an incremental patch that can temporarily block
> the pageout code from splitting up a huge page, so your direct
> users of huge pages can rely on them sticking around until the
> transaction is done?

Do we need the splitting? It seems that Andrea's firefox never needs to
split a huge page anyways.... ;-)

> > I still think we should get transparent huge page support straight up
> > first without complicated fallback schemes that makes huge pages difficult
> > to use.
>
> Without swapping, they will become difficult to use for system
> administrators, at least in the workloads we care about.

Huge pages are already in use through hugetlbs for such workloads. That
works without swap. So why is this suddenly such a must have requirement?

Why not swap 2M huge pages as a whole?


> I understand that your workloads may be different.

What in your workload forces hugetlb swap use? Just leaving a certain
percentage of memory for 4k pages addresses the issue right now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 12:35               ` Andrea Arcangeli
@ 2010-01-26 15:55                 ` Christoph Lameter
  2010-01-26 16:19                   ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 15:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Gleb Natapov, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton

On Tue, 26 Jan 2010, Andrea Arcangeli wrote:

> gup already does the trick of preventing swapping of only the pieces
> that are pinned. But it's ok only for temporary direct access like
> DMA, ideally if the access to the page can be stopped synchronously
> and the mapping is longstanding (not something dma can do, so O_DIRECT
> can't do) mmu notifier should be used to allow paging of the page and
> teardown the secondary mmu mapping.

How does it do that? Take a reference on each of the 512 pieces? Or does
it take one reference?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 15:47             ` Christoph Lameter
@ 2010-01-26 16:11               ` Andrea Arcangeli
  2010-01-26 16:30                 ` Christoph Lameter
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 16:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 09:47:51AM -0600, Christoph Lameter wrote:
> I have to disable swap to be able to make use of these huge pages?

No.

> Just because your configuration did not split does not mean that there
> is a guarantee of them not splitting. You need to guarantee that the VM
> does not split them in order to be able to safely refer to them from
> code (like I/O paths).

No. O_DIRECT already works on those pages without splitting them,
there is no need to split them, just run 512 gups like you would be
doing if those weren't hugepages.

If your I/O can be interrupted then just use mmu notifier, call
gup_fast, and be notified if anything runs that split the page.

Splitting the page doesn't mean relocating it, DMA won't be able to
notice. So if you use mmu notifier just 1 gup + put_page will be
enough exactly because with mmu notifier you won't need refcounting on
tail pages and head pages at all!

If you don't have longstanding mapping and a way to synchronously
interrupt the visibility of hugepages from your device, then likely
you work with small dma sizes like storage and networking does, and
gup each 4k will be fine.

> Earlier you stated that reclaim can remove 4k pieces of huge pages after a
> split. How does gup keep the huge pages stable while doing I/O? Does gup
> submit 512 pointers to 4k chunks or 1 pointer to a 2M chunk?

gup works like now, you just write code that works today on a
fragmented hugepage, and it'll still work. So you need to run 512 gup_fast
to be sure all 4k fragments are stable. But if you can use mmu
notifier just one gup_fast(&head_page), put_page(head_page) will be
enough after you're registered.

I'm unsure exactly what you need to do that won't be feasible with mmu
notifier and 1 gup or 512 gup.

> This implementation seems to only address the TLB pressure issue
> but not the scaling issue that arises because we have to handle data in
> 4k chunks (512 4k pointers instead of one 2M pointer). Scaling is not
> addressed because complex fallback logic sabotages a basic benefit of
> huge pages.

Scaling is addressed for everything, including collapsing the hugepage
back after swapin if they're fragmented because of that. Furthermore
we want to remove split_huge_page from as many paths as possible but
Rome wasn't built in a day. We need to stabilize and stress this code
now, then we include it, and extend it to tmpfs and pagecache.

Note a malloc(3G)+memset(3G) takes >5sec with lockdep without
transparent hugepage, or <2sec after "echo always >enabled", TLB
pressure is irrelevant in that workload that spends all time
allocating pages and clearing them through kernel direct
mapping. Your idea that this is only taking care of TLB pressure is
totally wrong and I posted benchmarks already as proof (which become
extreme the moment you enable lockdep and all the little locks becomes
more costly, so avoiding 512 page faults and doing a single call to
alloc_pages(order=9) speedup the workload more than 100%).

> > performance and functionality than what my patch delivers already
> > (ok swapping will be a little more efficient if done through 2M I/O
> > but swap performance isn't so critical). Our objective is to over time
> > eliminate the need of split_huge_page. khugepaged will remain required
> 
> Ok then establish some way to make these huge pages stable.

Again: register into mmu notifer, call gup_fast; put_page, and you're
done. 1 op, and just 3 cachelines for pgd,pud and pmd to get to the page.

> That all depends on what you mean by guarantee I guess.

mmu notifier is a must if the mapping is longstanding or you'll lock
the ram. It's also a lot more efficient than doing 512 gup_fast which
would achieve the same effect but it's evil against the VM (lock the
user virtual memory in ram) and requires 512 gup instead of just 1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 15:54             ` Christoph Lameter
@ 2010-01-26 16:16               ` Andrea Arcangeli
  2010-01-26 16:24                 ` Andi Kleen
                                   ` (2 more replies)
  2010-01-26 23:07               ` Rik van Riel
  1 sibling, 3 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 16:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 09:54:59AM -0600, Christoph Lameter wrote:
> Huge pages are already in use through hugetlbs for such workloads. That
> works without swap. So why is this suddenly such a must have requirement?

hugetlbfs is unusable when you're not doing a static alloc for 1 DBMS
in 1 machine with alloc size set in a config file that will then match
grub command line.

> Why not swap 2M huge pages as a whole?

That is nice thing to speedup swap bandwidth and reduce fragmentation,
just I couldn't make so many changes in one go. Later we can make this
change and remove a few split_huge_page from the rmap paths.

> What in your workload forces hugetlb swap use? Just leaving a certain
> percentage of memory for 4k pages addresses the issue right now.

hypervisor must be able to swap, furthermore when a VM exists we want
to be able to use that ram as pagecache (not to remain reserved in
some hugetlbfs). And we must be able to fallback to 4k allocations
always without userland being able to notice when unable to defrag,
all things hugetlbfs can't do. All designs that can't 100% fallback to
4k allocations are useless in my view as far as you want to keep the
word "transparent" in the description of the patch...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 15:55                 ` Christoph Lameter
@ 2010-01-26 16:19                   ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 16:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gleb Natapov, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton

On Tue, Jan 26, 2010 at 09:55:43AM -0600, Christoph Lameter wrote:
> How does it do that? Take a reference on each of the 512 pieces? Or does
> it take one reference?

_zero_ reference! gup_fast is there only to pagein, in fact we need to
add one new type of gup_fast that won't take a reference at all and
only ensures the pmd_trans_huge pmd or the regular pte is mapped
before returning (or it will page it in before returning if it
wasn't), with mmu notifier it is always wasteful to take page
pins.

So for now you will run put_page immediately after gup_fast
returns. The whole point of mmu notifier is not to require any
refcount on the pages (or if there are, they are forced to be released
by the mmu notifier methods before they return, otherwise they defeat
the whole purpose of registering into mmu notifier). So the best is
not to take refcounts at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:16               ` Andrea Arcangeli
@ 2010-01-26 16:24                 ` Andi Kleen
  2010-01-26 16:37                 ` Christoph Lameter
  2010-01-26 16:42                 ` Mel Gorman
  2 siblings, 0 replies; 79+ messages in thread
From: Andi Kleen @ 2010-01-26 16:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton

On Tue, Jan 26, 2010 at 05:16:25PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 26, 2010 at 09:54:59AM -0600, Christoph Lameter wrote:
> > Huge pages are already in use through hugetlbs for such workloads. That
> > works without swap. So why is this suddenly such a must have requirement?
> 
> hugetlbfs is unusable when you're not doing a static alloc for 1 DBMS
> in 1 machine with alloc size set in a config file that will then match
> grub command line.

AFAIK that's not true for 2MB pages after all the enhancements
Andy/Mel/et.al. did to the defragmentation heuristics, assuming
you have enough memory (or define movable zones)

hugetlbfs also does the transparent fallback. It's not pretty,
but it seems to work for a lot of people.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:11               ` Andrea Arcangeli
@ 2010-01-26 16:30                 ` Christoph Lameter
  2010-01-26 16:45                   ` Andrea Arcangeli
  2010-01-26 17:09                   ` Avi Kivity
  0 siblings, 2 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 16:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, 26 Jan 2010, Andrea Arcangeli wrote:
> No. O_DIRECT already works on those pages without splitting them,
> there is no need to split them, just run 512 gups like you would be
> doing if those weren't hugepages.

That show the scaling issue is not solved.

> If your I/O can be interrupted then just use mmu notifier, call
> gup_fast, and be notified if anything runs that split the page.

"Just use mmu notifier"? Its quite a cost to register/unregister the
memory range. Again unnecessary complexity here.

> Splitting the page doesn't mean relocating it, DMA won't be able to
> notice. So if you use mmu notifier just 1 gup + put_page will be
> enough exactly because with mmu notifier you won't need refcounting on
> tail pages and head pages at all!

Page migration can relocate 4k pieces it seems.

> If you don't have longstanding mapping and a way to synchronously
> interrupt the visibility of hugepages from your device, then likely
> you work with small dma sizes like storage and networking does, and
> gup each 4k will be fine.

"synchronously interrupt the visibility of hugepages"???? What does that
mean?

> > Earlier you stated that reclaim can remove 4k pieces of huge pages after a
> > split. How does gup keep the huge pages stable while doing I/O? Does gup
> > submit 512 pointers to 4k chunks or 1 pointer to a 2M chunk?
>
> gup works like now, you just write code that works today on a
> fragmented hugepage, and it'll still work. So you need to run 512 gup_fast
> to be sure all 4k fragments are stable. But if you can use mmu
> notifier just one gup_fast(&head_page), put_page(head_page) will be
> enough after you're registered.

Just dont want that... Would like to have one 2M page not 512 4k pages.

> Note a malloc(3G)+memset(3G) takes >5sec with lockdep without
> transparent hugepage, or <2sec after "echo always >enabled", TLB
> pressure is irrelevant in that workload that spends all time
> allocating pages and clearing them through kernel direct
> mapping. Your idea that this is only taking care of TLB pressure is
> totally wrong and I posted benchmarks already as proof (which become
> extreme the moment you enable lockdep and all the little locks becomes
> more costly, so avoiding 512 page faults and doing a single call to
> alloc_pages(order=9) speedup the workload more than 100%).

So the allocation works in 2M chunks. Okay that scales at that point but
code cannot rely on these 2M chunks to continue to exist without
ancilliary expensive measures (mmu notifier)

> > That all depends on what you mean by guarantee I guess.
>
> mmu notifier is a must if the mapping is longstanding or you'll lock
> the ram. It's also a lot more efficient than doing 512 gup_fast which
> would achieve the same effect but it's evil against the VM (lock the
> user virtual memory in ram) and requires 512 gup instead of just 1.

mmu notifier is expensive. The earlier implementations were able to get a
stable huge page reference by simply doing a get_page().


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:16               ` Andrea Arcangeli
  2010-01-26 16:24                 ` Andi Kleen
@ 2010-01-26 16:37                 ` Christoph Lameter
  2010-01-26 16:42                 ` Mel Gorman
  2 siblings, 0 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 16:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, 26 Jan 2010, Andrea Arcangeli wrote:

> On Tue, Jan 26, 2010 at 09:54:59AM -0600, Christoph Lameter wrote:
> > Huge pages are already in use through hugetlbs for such workloads. That
> > works without swap. So why is this suddenly such a must have requirement?
>
> hugetlbfs is unusable when you're not doing a static alloc for 1 DBMS
> in 1 machine with alloc size set in a config file that will then match
> grub command line.

Huge pages can be allocated / freed while the system is running. This has
been true for a long time.

> > Why not swap 2M huge pages as a whole?
>
> That is nice thing to speedup swap bandwidth and reduce fragmentation,
> just I couldn't make so many changes in one go. Later we can make this
> change and remove a few split_huge_page from the rmap paths.

You would have to mmu register these 2M pages in order to swap them to
disk with an operation that writes 2M in one go?

> > What in your workload forces hugetlb swap use? Just leaving a certain
> > percentage of memory for 4k pages addresses the issue right now.
>
> hypervisor must be able to swap, furthermore when a VM exists we want
> to be able to use that ram as pagecache (not to remain reserved in
> some hugetlbfs). And we must be able to fallback to 4k allocations
> always without userland being able to notice when unable to defrag,
> all things hugetlbfs can't do. All designs that can't 100% fallback to
> 4k allocations are useless in my view as far as you want to keep the
> word "transparent" in the description of the patch...

If the page cache can use huge pages then you can use that ram as page
cache.

Transparency is only necessary at the system API layer where user code
interacts with the kernel services. What the kernel internally does can be
different. 100% fallback within the kernel is not needed. 100% OS
interface compatibility is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:16               ` Andrea Arcangeli
  2010-01-26 16:24                 ` Andi Kleen
  2010-01-26 16:37                 ` Christoph Lameter
@ 2010-01-26 16:42                 ` Mel Gorman
  2010-01-26 16:52                   ` Andrea Arcangeli
  2 siblings, 1 reply; 79+ messages in thread
From: Mel Gorman @ 2010-01-26 16:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 05:16:25PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 26, 2010 at 09:54:59AM -0600, Christoph Lameter wrote:
> > Huge pages are already in use through hugetlbs for such workloads. That
> > works without swap. So why is this suddenly such a must have requirement?
> 
> hugetlbfs is unusable when you're not doing a static alloc for 1 DBMS
> in 1 machine with alloc size set in a config file that will then match
> grub command line.
> 

That may have been the case once upon a time but is far from accurate
now. I routinely run benchmarks against a database using huge pages that
isn't even hugepage-aware without going through insane steps. The huge
pages are often allocated when the system has already been running
several days (and in one case a few weeks) and I didn't have to be
overly specific on how many huge pages I needed either as additional
ones were allocated as required.

hugetlbfs may be not be ideal, but it's not quite as catastrophic as
commonly believed either.

> > Why not swap 2M huge pages as a whole?
> 
> That is nice thing to speedup swap bandwidth and reduce fragmentation,
> just I couldn't make so many changes in one go. Later we can make this
> change and remove a few split_huge_page from the rmap paths.
> 
> > What in your workload forces hugetlb swap use? Just leaving a certain
> > percentage of memory for 4k pages addresses the issue right now.
> 
> hypervisor must be able to swap, furthermore when a VM exists we want
> to be able to use that ram as pagecache (not to remain reserved in
> some hugetlbfs). And we must be able to fallback to 4k allocations
> always without userland being able to notice when unable to defrag,
> all things hugetlbfs can't do. All designs that can't 100% fallback to
> 4k allocations are useless in my view as far as you want to keep the
> word "transparent" in the description of the patch...
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:30                 ` Christoph Lameter
@ 2010-01-26 16:45                   ` Andrea Arcangeli
  2010-01-26 18:23                     ` Christoph Lameter
  2010-01-26 17:09                   ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 16:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 10:30:43AM -0600, Christoph Lameter wrote:
> So the allocation works in 2M chunks. Okay that scales at that point but
> code cannot rely on these 2M chunks to continue to exist without
> ancilliary expensive measures (mmu notifier)

mmu notifier "ancilliary expensive measures"? Ask robin with xpmem,
check gru and see how slower kvm runs thanks to mmu notifier..

All you're asking is in the future to also add a 2M-wide pin in gup,
that is not what the current API provides, and so it requires a new
gup_huge_fast API, not the current one, and it is feasible! Just not
done in this implementation as it'd make things more complex.

Just stop this red herring of yours that a replacement of the
pmd_trans_huge with a pte has anything to do with the physical side of
the hugepage. Splitting the page doesn't alter the physical side at
all, it's a _virtual_ split, and your remaining argument is how to
take a global pin on only the head page and having it distributed to
all tail pages when the virtual split happens. This is utterly
unnecessary overhead to all subsystems using mmu notifier, but it
might speedup O_DIRECT a little bit on hugepages, so it may happen
later.

> mmu notifier is expensive. The earlier implementations were able to get a
> stable huge page reference by simply doing a get_page().

That only works on hugetlbfs and it's not a property of gup_fast. It
breaks if userland maps a different mapping under you or if
libhugetlbfs is unloaded. We can extend the refcounting logic to
achieve the same "1 op pins 2M" feature but first of all we need a new
API gup_huge_fast, current api doesn't allow it. I think it's simply
wise to wait stabilizing this, before adding a new feature but if you
want me to do it now that's ok with me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:42                 ` Mel Gorman
@ 2010-01-26 16:52                   ` Andrea Arcangeli
  2010-01-26 17:26                     ` Mel Gorman
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 16:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

> hugetlbfs may be not be ideal, but it's not quite as catastrophic as
> commonly believed either.

I want 100% of userbase to take advantage of it, hugetlbfs isn't even
mounted by default... and there is no way to use libhugetlbfs by
default.

I think hugetlbfs is fine for a niche of users (for those power users
kernel hackers and huge DBMS it may also be better than transparent
hugepage and they should keep using it!!! thanks to being able to
reserve pages at boot), but for the 99% of userbase it's exactly as
catastrophic as commonly believed. Otherwise I am 100% sure that I
wouldn't be the first one on linux to decrease the tlb misses with 2M
pages while watching videos on youtube (>60M on hugepages will happen
with atom netbook). And that's nothing compared to many other
workloads. Yes not so important for desktop but on server especially
with EPT/NPT it's a must and hugetlbfs is as catastrophic as on
"default desktop" in the virtualization cloud.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:30                 ` Christoph Lameter
  2010-01-26 16:45                   ` Andrea Arcangeli
@ 2010-01-26 17:09                   ` Avi Kivity
  1 sibling, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2010-01-26 17:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On 01/26/2010 06:30 PM, Christoph Lameter wrote:
> On Tue, 26 Jan 2010, Andrea Arcangeli wrote:
>    
>> No. O_DIRECT already works on those pages without splitting them,
>> there is no need to split them, just run 512 gups like you would be
>> doing if those weren't hugepages.
>>      
> That show the scaling issue is not solved.
>    

Well, gup works for a range of addresses, so all you need it one, and 
I'm sure it can be optimized to take advantage of transparent huge pages.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:52                   ` Andrea Arcangeli
@ 2010-01-26 17:26                     ` Mel Gorman
  2010-01-26 19:46                       ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Mel Gorman @ 2010-01-26 17:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 05:52:55PM +0100, Andrea Arcangeli wrote:
> > hugetlbfs may be not be ideal, but it's not quite as catastrophic as
> > commonly believed either.
> 
> I want 100% of userbase to take advantage of it, hugetlbfs isn't even
> mounted by default... and there is no way to use libhugetlbfs by
> default.
> 
> I think hugetlbfs is fine for a niche of users (for those power users
> kernel hackers and huge DBMS it may also be better than transparent
> hugepage and they should keep using it!!! thanks to being able to
> reserve pages at boot), but for the 99% of userbase it's exactly as
> catastrophic as commonly believed. Otherwise I am 100% sure that I
> wouldn't be the first one on linux to decrease the tlb misses with 2M

You're not, I beat you to it a long time ago. In fact, I just watched a dumb
hit smack into a treadmill (feeling badminded) with the browser using huge
pages in the background just to confirm I wasn't imagining it.  Launched with

hugectl --shm --heap epiphany-browser

HugePages_Total:       5
HugePages_Free:        1
HugePages_Rsvd:        1
HugePages_Surp:        5
Hugepagesize:       4096 kB
(Surp implies the huge pages were allocated on demand, not statically)

17:22:01 up 7 days,  1:05, 24 users,  load average: 0.62, 0.30, 0.13

Yes, this is not transparent and it's unlikely that a normal user would go
to the hassle although conceivably a distro could set a launcher to
automtaically try huge pages where available.

I'm just saying that hugetlbfs and the existing utilities are not so bad
as to be slammed. Just because it's possible to do something like this does
not detract from transparent support in any way.

> pages while watching videos on youtube (>60M on hugepages will happen
> with atom netbook). And that's nothing compared to many other
> workloads. Yes not so important for desktop but on server especially
> with EPT/NPT it's a must and hugetlbfs is as catastrophic as on
> "default desktop" in the virtualization cloud.
> 

In virtualisation in particular, the lack of swapping makes hugetlbfs a
no-go in it's current form. No doubt about it and the transparent
support will certainly shine with respect to KVM.

On the flip-side, architecture limitations likely make transparent
support a no-go on IA-64 and very likely PPC64 so it doesn't solve
everything either.

The existing stuff will continue to exist alongside transparent support
because they are ideal in different situations.

FWIW, I'm still reading through the patches and have not spotted anything
new that is problematic but I'm only half-way through. By and large, I'm
pro-the-patches but am somewhat compelled to defend hugetlbfs :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 16:45                   ` Andrea Arcangeli
@ 2010-01-26 18:23                     ` Christoph Lameter
  0 siblings, 0 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-26 18:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, 26 Jan 2010, Andrea Arcangeli wrote:

> > So the allocation works in 2M chunks. Okay that scales at that point but
> > code cannot rely on these 2M chunks to continue to exist without
> > ancilliary expensive measures (mmu notifier)
>
> mmu notifier "ancilliary expensive measures"? Ask robin with xpmem,
> check gru and see how slower kvm runs thanks to mmu notifier..

This is overhead that will be there everytime you get a reference on a 2M
page. Its not a one time thing like with xpmem and kvm.

> All you're asking is in the future to also add a 2M-wide pin in gup,
> that is not what the current API provides, and so it requires a new
> gup_huge_fast API, not the current one, and it is feasible! Just not
> done in this implementation as it'd make things more complex.

I have never asked for anything in gup. gup does not break up a huge page.

> Just stop this red herring of yours that a replacement of the
> pmd_trans_huge with a pte has anything to do with the physical side of
> the hugepage. Splitting the page doesn't alter the physical side at
> all, it's a _virtual_ split, and your remaining argument is how to

Splitting a page allows reclaim / page migration to occur on 4k components
of the huge page. Thus you are not guaranteed the integrity of your 2M
page.

> take a global pin on only the head page and having it distributed to
> all tail pages when the virtual split happens. This is utterly
> unnecessary overhead to all subsystems using mmu notifier, but it
> might speedup O_DIRECT a little bit on hugepages, so it may happen
> later.

Simply taking a refcount on the head page of a compound page should pin it
for good until the refcount is released. These are established conventions
and doing so has minimal overhead.

> > mmu notifier is expensive. The earlier implementations were able to get a
> > stable huge page reference by simply doing a get_page().
>
> That only works on hugetlbfs and it's not a property of gup_fast. It
> breaks if userland maps a different mapping under you or if
> libhugetlbfs is unloaded. We can extend the refcounting logic to
> achieve the same "1 op pins 2M" feature but first of all we need a new
> API gup_huge_fast, current api doesn't allow it. I think it's simply
> wise to wait stabilizing this, before adding a new feature but if you
> want me to do it now that's ok with me.

Maybe you can explain why gup is so important to you? Its one example of
establishing references to pages. Establishing a page reference is a basic
feature of the Linux Operating system and doing so pins that page in
memory for until the code is done and releases the refcount.

Why not keep the same semantics for huge pages instead of all the
complicated stuff here? No need for a compound lock etc etc.

If you want to break up a huge page then make sure that all refcounts on
the head page are accounted for and then convert it to 4k chunks.
Basically a form of page migration and it does not require any new
VM semantics or locks.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 17:26                     ` Mel Gorman
@ 2010-01-26 19:46                       ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-26 19:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Rik van Riel, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, Jan 26, 2010 at 05:26:13PM +0000, Mel Gorman wrote:
> You're not, I beat you to it a long time ago. In fact, I just watched a dumb
> hit smack into a treadmill (feeling badminded) with the browser using huge
> pages in the background just to confirm I wasn't imagining it.  Launched with
> 
> hugectl --shm --heap epiphany-browser
> 
> HugePages_Total:       5
> HugePages_Free:        1
> HugePages_Rsvd:        1
> HugePages_Surp:        5
> Hugepagesize:       4096 kB
> (Surp implies the huge pages were allocated on demand, not statically)

eheh ;)

> Yes, this is not transparent and it's unlikely that a normal user would go
> to the hassle although conceivably a distro could set a launcher to
> automtaically try huge pages where available.

It'll never happen, I think hugetlbfs can't even be mounted by default
on all distro... or it's not writable, otherwise it's a mlock
DoS...

> I'm just saying that hugetlbfs and the existing utilities are not so bad
> as to be slammed. Just because it's possible to do something like this does
> not detract from transparent support in any way.

Agreed, power users can already take advantage from hugepages, I don't
object that, problem is most people can't and we want to take
advantage of them not just in firefox but whenever possible. Another
app using hugepages is knotify4 for example.

> In virtualisation in particular, the lack of swapping makes hugetlbfs a
> no-go in it's current form. No doubt about it and the transparent
> support will certainly shine with respect to KVM.

Exactly.

> On the flip-side, architecture limitations likely make transparent
> support a no-go on IA-64 and very likely PPC64 so it doesn't solve
> everything either.

Exactly! This is how we discovered that hugetlbfs will stay around
maybe forever, regardless how transparent hugepage will expand over
the tmpfs/pagecache layer.

> The existing stuff will continue to exist alongside transparent support
> because they are ideal in different situations.

Agreed.

> FWIW, I'm still reading through the patches and have not spotted anything
> new that is problematic but I'm only half-way through. By and large, I'm
> pro-the-patches but am somewhat compelled to defend hugetlbfs :)

NOTE: I very much defend hugetlbfs too! But not for using it with
firefox on desktop nor on virtualization cloud. For DBMS hugetlbfs may
remain superior solution than transparent hugepage because of the
finegrined reservation capabilities. We're in full agreement ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 15:54             ` Christoph Lameter
  2010-01-26 16:16               ` Andrea Arcangeli
@ 2010-01-26 23:07               ` Rik van Riel
  2010-01-27 18:33                 ` Christoph Lameter
  1 sibling, 1 reply; 79+ messages in thread
From: Rik van Riel @ 2010-01-26 23:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On 01/26/2010 10:54 AM, Christoph Lameter wrote:
> On Mon, 25 Jan 2010, Rik van Riel wrote:

>>> I still think we should get transparent huge page support straight up
>>> first without complicated fallback schemes that makes huge pages difficult
>>> to use.
>>
>> Without swapping, they will become difficult to use for system
>> administrators, at least in the workloads we care about.
>
> Huge pages are already in use through hugetlbs for such workloads. That
> works without swap. So why is this suddenly such a must have requirement?
>
> Why not swap 2M huge pages as a whole?

A few reasons:

1) Fragmentation of swap space (or the need for a separate
    swap area for 2MB pages)

2) There is no code to allow us to swap out 2MB pages

3) Internal fragmentation.  While 4kB pages are smaller than
    the objects allocated by many programs, it is likely that
    most 2MB pages contain both frequently used and rarely
    used malloced objects.  Swapping out just the rarely used
    4kB pages from a number of 2MB pages allows us to keep all
    of the frequently used data in memory.

    Swapping out 2MB pages, on the other hand, makes it harder
    to keep the working set in memory. TLB misses are much cheaper
    than major page faults.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-22  0:13       ` KAMEZAWA Hiroyuki
@ 2010-01-27 11:27         ` Balbir Singh
  2010-01-28  0:50           ` Daisuke Nishimura
  2010-01-28 11:39           ` Andrea Arcangeli
  0 siblings, 2 replies; 79+ messages in thread
From: Balbir Singh @ 2010-01-27 11:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

On Friday 22 January 2010 05:43 AM, KAMEZAWA Hiroyuki wrote:
> 
>> Now the only real pain remains in the LRU list accounting, I tried to
>> solve it but found no clean way that didn't require mess all over
>> vmscan.c. So for now hugepages in lru are accounted as 4k pages
>> ;). Nothing breaks just stats won't be as useful to the admin...
>>
> Hmm, interesting/important problem...I keep it in my mind.

I hope the memcg accounting is not broken, I see you do the right thing
while charging pages. The patch overall seems alright. Could you please
update the Documentation/cgroups/memory.txt file as well with what these
changes mean and memcg_tests.txt to indicate how to test the changes?

-- 
Three Cheers,
Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00 of 30] Transparent Hugepage support #3
  2010-01-26 23:07               ` Rik van Riel
@ 2010-01-27 18:33                 ` Christoph Lameter
  0 siblings, 0 replies; 79+ messages in thread
From: Christoph Lameter @ 2010-01-27 18:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Tue, 26 Jan 2010, Rik van Riel wrote:

> > Huge pages are already in use through hugetlbs for such workloads. That
> > works without swap. So why is this suddenly such a must have requirement?
> >
> > Why not swap 2M huge pages as a whole?
>
> A few reasons:
>
> 1) Fragmentation of swap space (or the need for a separate
>    swap area for 2MB pages)

Swap is already statically allocated. Would not be too difficult to add a
2M area.

> 2) There is no code to allow us to swap out 2MB pages

If the page descriptors stay the same for huge pages (one page struct
describes one 2MB page without any of the weird stuff added in this set)
then its simple to do with minor modifications to the existing code.

> 3) Internal fragmentation.  While 4kB pages are smaller than
>    the objects allocated by many programs, it is likely that
>    most 2MB pages contain both frequently used and rarely
>    used malloced objects.  Swapping out just the rarely used
>    4kB pages from a number of 2MB pages allows us to keep all
>    of the frequently used data in memory.

But that makes the huge page vanish. So no benefit at all from the huge
page logic. Just overhead.

>    Swapping out 2MB pages, on the other hand, makes it harder
>    to keep the working set in memory. TLB misses are much cheaper
>    than major page faults.

True. Thats why one should not swap if one wants decent performance.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-27 11:27         ` Balbir Singh
@ 2010-01-28  0:50           ` Daisuke Nishimura
  2010-01-28 11:39           ` Andrea Arcangeli
  1 sibling, 0 replies; 79+ messages in thread
From: Daisuke Nishimura @ 2010-01-28  0:50 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Rik van Riel, Mel Gorman, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton,
	Daisuke Nishimura

On Wed, 27 Jan 2010 16:57:00 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Friday 22 January 2010 05:43 AM, KAMEZAWA Hiroyuki wrote:
> > 
> >> Now the only real pain remains in the LRU list accounting, I tried to
> >> solve it but found no clean way that didn't require mess all over
> >> vmscan.c. So for now hugepages in lru are accounted as 4k pages
> >> ;). Nothing breaks just stats won't be as useful to the admin...
> >>
> > Hmm, interesting/important problem...I keep it in my mind.
> 
> I hope the memcg accounting is not broken, I see you do the right thing
> while charging pages. The patch overall seems alright. Could you please
> update the Documentation/cgroups/memory.txt file as well with what these
> changes mean and memcg_tests.txt to indicate how to test the changes?
> 
I think we need update memcg's stats too. Otherwise the usage_in_bytes in root
cgroup become wrong(of course, those stats are also important for other cgroups).
If new vm_stat for transparent hugepage is added, it would be better to add it
to memcg too.

Moreover, considering the behavior of split_huge_page, we should update both
css->refcnt and pc->mem_cgroup about all the tail pages. Otherwise, if a transparent
hugepage splitted, tail pages of it become stale from the viewpoint of memcg,
i.e. those pages are not linked to any memcg's LRU.
It's another topic where we should update those data. IMHO, css->refcnt can be update
in try_charge/uncharge(I think __css_get()/__css_put(), which are now defined in mmotm,
can be used for it w/o adding big overhead). As for pc->mem_cgroup, I think it would
be better to update them by adding some hook in __split_huge_page_map() or some
to avoid adding some overhead to hot-path(charge/uncharge).


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-27 11:27         ` Balbir Singh
  2010-01-28  0:50           ` Daisuke Nishimura
@ 2010-01-28 11:39           ` Andrea Arcangeli
  2010-01-28 12:23             ` Balbir Singh
  1 sibling, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 11:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

On Wed, Jan 27, 2010 at 04:57:00PM +0530, Balbir Singh wrote:
> On Friday 22 January 2010 05:43 AM, KAMEZAWA Hiroyuki wrote:
> > 
> >> Now the only real pain remains in the LRU list accounting, I tried to
> >> solve it but found no clean way that didn't require mess all over
> >> vmscan.c. So for now hugepages in lru are accounted as 4k pages
> >> ;). Nothing breaks just stats won't be as useful to the admin...
> >>
> > Hmm, interesting/important problem...I keep it in my mind.
> 
> I hope the memcg accounting is not broken, I see you do the right thing
> while charging pages. The patch overall seems alright. Could you please
> update the Documentation/cgroups/memory.txt file as well with what these
> changes mean and memcg_tests.txt to indicate how to test the changes?

Where exactly does that memory.txt go into the implementation details?
Grepping the function names I changed over that file leads to
nothing. It doesn't seem to be covering internals at all. The other
file only place that shows some function names I could see needing an
update is this:

     At try_charge(), there are no flags to say "this page is
     charged".
     at this point, usage += PAGE_SIZE.

     At commit(), the function checks the page should be charged or
     not
     and set flags or avoid charging.(usage -= PAGE_SIZE)

     At cancel(), simply usage -= PAGE_SIZE.

but it won't go into much more details than this, so I can only
imagine to add this, explaining how the real page size is obtained and
if I would go into the compound page accounting explanation that
probably would bring it to a detail level that file didn't have in the
first place.

But again I'm very confused on what exactly you expect me to update on
that file, so if below isn't ok best would be that you send me a patch
to integrate with your signoff. That would be the preferred way to me.

Thanks!
Andrea

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -4,6 +4,10 @@ NOTE: The Memory Resource Controller has
 to as the memory controller in this document. Do not confuse memory controller
 used here with the memory controller that is used in hardware.
 
+NOTE: When in this documentation we refer to PAGE_SIZE, we actually
+mean the real page size of the page being accounted which is bigger than
+PAGE_SIZE for compound pages.
+
 Salient features
 
 a. Enable control of Anonymous, Page Cache (mapped and unmapped) and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-28 11:39           ` Andrea Arcangeli
@ 2010-01-28 12:23             ` Balbir Singh
  2010-01-28 12:36               ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Balbir Singh @ 2010-01-28 12:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

* Andrea Arcangeli <aarcange@redhat.com> [2010-01-28 12:39:15]:

> On Wed, Jan 27, 2010 at 04:57:00PM +0530, Balbir Singh wrote:
> > On Friday 22 January 2010 05:43 AM, KAMEZAWA Hiroyuki wrote:
> > > 
> > >> Now the only real pain remains in the LRU list accounting, I tried to
> > >> solve it but found no clean way that didn't require mess all over
> > >> vmscan.c. So for now hugepages in lru are accounted as 4k pages
> > >> ;). Nothing breaks just stats won't be as useful to the admin...
> > >>
> > > Hmm, interesting/important problem...I keep it in my mind.
> > 
> > I hope the memcg accounting is not broken, I see you do the right thing
> > while charging pages. The patch overall seems alright. Could you please
> > update the Documentation/cgroups/memory.txt file as well with what these
> > changes mean and memcg_tests.txt to indicate how to test the changes?
> 
> Where exactly does that memory.txt go into the implementation details?
> Grepping the function names I changed over that file leads to
> nothing. It doesn't seem to be covering internals at all. The other
> file only place that shows some function names I could see needing an
> update is this:
> 
>      At try_charge(), there are no flags to say "this page is
>      charged".
>      at this point, usage += PAGE_SIZE.
> 
>      At commit(), the function checks the page should be charged or
>      not
>      and set flags or avoid charging.(usage -= PAGE_SIZE)
> 
>      At cancel(), simply usage -= PAGE_SIZE.
> 
> but it won't go into much more details than this, so I can only
> imagine to add this, explaining how the real page size is obtained and
> if I would go into the compound page accounting explanation that
> probably would bring it to a detail level that file didn't have in the
> first place.
> 

I would expect some Documentation stating the following

1. Impact of transparent hugepages on memcg
2. What does this mean to limit_in_bytes and usage_in_bytes and other
features
3. What does this mean for OOM, reclaim, etc, can there be some
side-effects.


> But again I'm very confused on what exactly you expect me to update on
> that file, so if below isn't ok best would be that you send me a patch
> to integrate with your signoff. That would be the preferred way to me.
>

I'll read through your patchset and see if I can come up with a useful
patch. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 28 of 30] memcg huge memory
  2010-01-28 12:23             ` Balbir Singh
@ 2010-01-28 12:36               ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2010-01-28 12:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Jan 28, 2010 at 05:53:14PM +0530, Balbir Singh wrote:
> I would expect some Documentation stating the following
> 
> 1. Impact of transparent hugepages on memcg

Expected is none except using right page size for compound pages like
described in my comment so far. If there is an impact visible to the
user then we've got something to fix. Only change in implementation
terms is to use the right page_size instead of fixed PAGE_SIZE.

> 2. What does this mean to limit_in_bytes and usage_in_bytes and other
> features

Dunno but I would expect no change at all.

> 3. What does this mean for OOM, reclaim, etc, can there be some
> side-effects.

Zero impact, but lru ordering isn't always guaranteed _identical_ as
tail pages may have to be added to the lru while the lru head is
isolated and we can't mangle over the stack of the other cpu that is
accessed lockless. Same lru ordering is guaranteed however when
split_huge_page runs on a page that has PageLRU set (I add tail pages
to page_head->lru instead of the zone lru head in that case). besides
this lru detail may change in future implementation and it is totally
unrelated to memcg as far as I can tell so no idea why to document it
there...

> I'll read through your patchset and see if I can come up with a useful
> patch. 

Ok, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2010-01-28 12:37 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-21  6:20 [PATCH 00 of 30] Transparent Hugepage support #3 Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 01 of 30] define MADV_HUGEPAGE Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 02 of 30] compound_lock Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 03 of 30] alter compound get_page/put_page Andrea Arcangeli
2010-01-21 17:35   ` Dave Hansen
2010-01-23 17:39     ` Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 04 of 30] clear compound mapping Andrea Arcangeli
2010-01-21 17:43   ` Dave Hansen
2010-01-23 17:55     ` Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 05 of 30] add native_set_pmd_at Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 06 of 30] add pmd paravirt ops Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 07 of 30] no paravirt version of pmd ops Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 08 of 30] export maybe_mkwrite Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 09 of 30] comment reminder in destroy_compound_page Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 10 of 30] config_transparent_hugepage Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 11 of 30] add pmd mangling functions to x86 Andrea Arcangeli
2010-01-21 17:47   ` Dave Hansen
2010-01-21 19:14     ` Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 12 of 30] add pmd mangling generic functions Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 13 of 30] special pmd_trans_* functions Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 14 of 30] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 15 of 30] pte alloc trans splitting Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 16 of 30] add pmd mmu_notifier helpers Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 17 of 30] clear page compound Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 18 of 30] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 19 of 30] ensure mapcount is taken on head pages Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 20 of 30] split_huge_page_mm/vma Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 21 of 30] split_huge_page paging Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 22 of 30] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-01-21 20:40   ` Christoph Lameter
2010-01-21 23:01     ` Andrea Arcangeli
2010-01-21 23:17       ` Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 23 of 30] clear_copy_huge_page Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 24 of 30] kvm mmu transparent hugepage support Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 25 of 30] transparent hugepage core Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 26 of 30] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 27 of 30] memcg compound Andrea Arcangeli
2010-01-21  7:07   ` KAMEZAWA Hiroyuki
2010-01-21 15:44     ` Andrea Arcangeli
2010-01-21 23:55       ` KAMEZAWA Hiroyuki
2010-01-21  6:20 ` [PATCH 28 of 30] memcg huge memory Andrea Arcangeli
2010-01-21  7:16   ` KAMEZAWA Hiroyuki
2010-01-21 16:08     ` Andrea Arcangeli
2010-01-22  0:13       ` KAMEZAWA Hiroyuki
2010-01-27 11:27         ` Balbir Singh
2010-01-28  0:50           ` Daisuke Nishimura
2010-01-28 11:39           ` Andrea Arcangeli
2010-01-28 12:23             ` Balbir Singh
2010-01-28 12:36               ` Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 29 of 30] transparent hugepage vmstat Andrea Arcangeli
2010-01-21  6:20 ` [PATCH 30 of 30] khugepaged Andrea Arcangeli
2010-01-22 14:46 ` [PATCH 00 of 30] Transparent Hugepage support #3 Christoph Lameter
2010-01-22 15:19   ` Andrea Arcangeli
2010-01-22 16:51     ` Christoph Lameter
2010-01-23 17:58       ` Andrea Arcangeli
2010-01-25 21:50         ` Christoph Lameter
2010-01-25 22:46           ` Andrea Arcangeli
2010-01-26 15:47             ` Christoph Lameter
2010-01-26 16:11               ` Andrea Arcangeli
2010-01-26 16:30                 ` Christoph Lameter
2010-01-26 16:45                   ` Andrea Arcangeli
2010-01-26 18:23                     ` Christoph Lameter
2010-01-26 17:09                   ` Avi Kivity
2010-01-26  0:52           ` Rik van Riel
2010-01-26  6:53             ` Gleb Natapov
2010-01-26 12:35               ` Andrea Arcangeli
2010-01-26 15:55                 ` Christoph Lameter
2010-01-26 16:19                   ` Andrea Arcangeli
2010-01-26 15:54             ` Christoph Lameter
2010-01-26 16:16               ` Andrea Arcangeli
2010-01-26 16:24                 ` Andi Kleen
2010-01-26 16:37                 ` Christoph Lameter
2010-01-26 16:42                 ` Mel Gorman
2010-01-26 16:52                   ` Andrea Arcangeli
2010-01-26 17:26                     ` Mel Gorman
2010-01-26 19:46                       ` Andrea Arcangeli
2010-01-26 23:07               ` Rik van Riel
2010-01-27 18:33                 ` Christoph Lameter
2010-01-26 11:24 ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.