linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00 of 28] Transparent Hugepage support #2
@ 2009-12-17 19:00 Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 01 of 28] compound_lock Andrea Arcangeli
                   ` (29 more replies)
  0 siblings, 30 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

Hello,

This is an update of my status on the transparent hugepage patchset. Quite
some changes happened in the last two weeks as I handled all feedback
provided so far (notably from Avi, Andi, Nick and others), and continuted on
the original todo list.

On the "brainer" side perhaps the most notable change worth review is one idea
suggested by Avi that during the hugepage-split userland can still access the
memory. Even when it's not shared, it can write to it. Because the split
happens in place and it only mangles over kernel page-metadata structures,
the page-data not. So I replaced the notpresent bit with a _SPLITTING bit
using a reserved pmd bit (that is never used in the pmd, it was only used by
two other features in the pte). I renamed it to splitting and not frozen under
Andi's suggestion. collapse_huge_page then will not be able to use the same
splitting bit, as collapse_huge_page is not working in place so it has to at
least wrprotect the page during the copy.

With madvise(MADV_HUGEPAGE) already functional (try running the below program
w/ and w/o madvise after "echo madvise >
/sys/kernel/mm/transparent_hugepage/enabled" to notice the speed difference)
the only notable bit missing that I'm still working on (on top of this
patchset) is the khugepaged daemon (and later possibly the removal of
split_huge_page_mm from some places with lower priority). In the meantime
further review of the patchset is very welcome.

-----------
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <sys/mman.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p;
	if (posix_memalign((void **)&p, 4096, SIZE))
		perror("memalign"), exit(1);
	madvise(p, SIZE, 14);
	memset(p, 0, SIZE);

	return 0;
}
-----------

I've also been reported the last patchset doesn't boot on some huge system
with EFI, so I recommend trying again with this latest patchset and if it
still doesn't boot it's good idea to try with transparent_hugepage=2 as boot
parameter (which only enables transparent hugepages under madvise).

The updated quilt tree is here:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.32-843b53823beb/transparent_hugepage-2/

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 01 of 28] compound_lock
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:46   ` Christoph Lameter
  2009-12-17 19:00 ` [PATCH 02 of 28] alter compound get_page/put_page Andrea Arcangeli
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -294,6 +295,16 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+	bit_spin_lock(PG_compound_lock, &page->flags);
+}
+
+static inline void compound_unlock(struct page *page)
+{
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,7 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+	PG_compound_lock,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 02 of 28] alter compound get_page/put_page
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 01 of 28] compound_lock Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:50   ` Christoph Lameter
  2009-12-17 19:00 ` [PATCH 03 of 28] clear compound mapping Andrea Arcangeli
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -43,6 +43,14 @@ static noinline int gup_pte_range(pmd_t 
 		page = pte_page(pte);
 		if (!page_cache_get_speculative(page))
 			return 0;
+		if (PageTail(page)) {
+			/*
+			 * __split_huge_page_refcount() cannot run
+			 * from under us.
+			 */
+			VM_BUG_ON(atomic_read(&page->_count) < 0);
+			atomic_inc(&page->_count);
+		}
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_page(page);
 			return 0;
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -128,6 +128,14 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page)) {
+			/*
+			 * __split_huge_page_refcount() cannot run
+			 * from under us.
+			 */
+			VM_BUG_ON(atomic_read(&page->_count) < 0);
+			atomic_inc(&page->_count);
+		}
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -319,9 +319,14 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+		/* __split_huge_page_refcount can't run under get_page */
+		VM_BUG_ON(!PageTail(page));
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -409,7 +409,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 1 << PG_compound_lock)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,80 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_page(page);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+			out_put_head:
+				put_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+			else
+				compound_unlock(page_head);
+			return;
+		} else
+			/* page_head is a dangling pointer */
+			goto out_put_single;
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +137,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 03 of 28] clear compound mapping
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 01 of 28] compound_lock Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 02 of 28] alter compound get_page/put_page Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 04 of 28] add native_set_pmd_at Andrea Arcangeli
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -583,6 +583,8 @@ static void __free_pages_ok(struct page 
 
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 04 of 28] add native_set_pmd_at
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 03 of 28] clear compound mapping Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 05 of 28] add pmd paravirt ops Andrea Arcangeli
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -528,6 +528,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 05 of 28] add pmd paravirt ops
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 04 of 28] add native_set_pmd_at Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 06 of 28] no paravirt version of pmd ops Andrea Arcangeli
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -449,6 +449,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -456,6 +461,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -557,6 +568,16 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 06 of 28] no paravirt version of pmd ops
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 05 of 28] add pmd paravirt ops Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 07 of 28] export maybe_mkwrite Andrea Arcangeli
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -33,6 +33,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -57,6 +58,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 07 of 28] export maybe_mkwrite
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 06 of 28] no paravirt version of pmd ops Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 08 of 28] comment reminder in destroy_compound_page Andrea Arcangeli
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -380,6 +380,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1943,19 +1943,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 08 of 28] comment reminder in destroy_compound_page
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 07 of 28] export maybe_mkwrite Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 09 of 28] config_transparent_hugepage Andrea Arcangeli
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -310,6 +310,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 09 of 28] config_transparent_hugepage
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 08 of 28] comment reminder in destroy_compound_page Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 10 of 28] add pmd mangling functions to x86 Andrea Arcangeli
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -282,3 +282,17 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support" if EMBEDDED
+	depends on X86_64
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 10 of 28] add pmd mangling functions to x86
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 09 of 28] config_transparent_hugepage Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18 18:56   ` Mel Gorman
  2009-12-17 19:00 ` [PATCH 11 of 28] add pmd mangling generic functions Andrea Arcangeli
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -95,11 +95,21 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
 static inline int pte_file(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_FILE;
@@ -150,6 +160,13 @@ static inline pte_t pte_set_flags(pte_t 
 	return native_make_pte(v | set);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
 static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 {
 	pteval_t v = native_pte_val(pte);
@@ -157,6 +174,13 @@ static inline pte_t pte_clear_flags(pte_
 	return native_make_pte(v & ~clear);
 }
 
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -167,11 +191,21 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkexec(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_NX);
@@ -182,16 +216,36 @@ static inline pte_t pte_mkdirty(pte_t pt
 	return pte_set_flags(pte, _PAGE_DIRTY);
 }
 
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
 static inline pte_t pte_mkyoung(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
 static inline pte_t pte_mkwrite(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_PSE);
@@ -320,6 +374,11 @@ static inline int pte_same(pte_t a, pte_
 	return a.pte == b.pte;
 }
 
+static inline int pmd_same(pmd_t a, pmd_t b)
+{
+	return a.pmd == b.pmd;
+}
+
 static inline int pte_present(pte_t a)
 {
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
@@ -351,7 +410,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
@@ -372,6 +431,7 @@ static inline unsigned long pmd_index(un
  * to linux/mm.h:page_to_nid())
  */
 #define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
 /*
  * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE]
@@ -568,14 +628,21 @@ struct vm_area_struct;
 extern int ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 				     unsigned long addr, pte_t *ptep);
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
@@ -586,6 +653,14 @@ static inline pte_t ptep_get_and_clear(s
 	return pte;
 }
 
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep,
@@ -612,6 +687,16 @@ static inline void ptep_set_wrprotect(st
 	pte_update(mm, addr, ptep);
 }
 
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmd);
+}
+
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -71,6 +71,18 @@ static inline pte_t native_ptep_get_and_
 	return ret;
 #endif
 }
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(NULL, 0, xp);
+	return ret;
+#endif
+}
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -288,6 +288,23 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+	}
+
+	return changed;
+}
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -303,6 +320,21 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -315,6 +347,34 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+	}
+}
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 11 of 28] add pmd mangling generic functions
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 10 of 28] add pmd mangling functions to x86 Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 12 of 28] special pmd_trans_* functions Andrea Arcangeli
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -23,6 +23,19 @@
 	}								  \
 	__changed;							  \
 })
+
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_MASK);			\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_SIZE);	\
+		}							\
+		__changed;						\
+	})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
@@ -37,6 +50,17 @@
 			   (__ptep), pte_mkold(__pte));			\
 	r;								\
 })
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -48,6 +72,16 @@
 		flush_tlb_page(__vma, __address);			\
 	__young;							\
 })
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_SIZE);		\
+	__young;							\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -57,6 +91,13 @@
 	pte_clear((__mm), (__address), (__ptep));			\
 	__pte;								\
 })
+
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
@@ -88,6 +129,15 @@ do {									\
 	flush_tlb_page(__vma, __address);				\
 	__pte;								\
 })
+
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_SIZE);	\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -97,10 +147,26 @@ static inline void ptep_set_wrprotect(st
 	pte_t old_pte = *ptep;
 	set_pte_at(mm, address, ptep, pte_wrprotect(old_pte));
 }
+
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_SIZE);	\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
 #endif
 
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 12 of 28] special pmd_trans_* functions
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 11 of 28] add pmd mangling generic functions Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 13 of 28] bail out gup_fast on freezed pmd Andrea Arcangeli
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -394,6 +394,24 @@ static inline int pmd_present(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_PRESENT;
 }
 
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+#else
+	return 0;
+#endif
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_val(pmd) & _PAGE_PSE;
+#else
+	return 0;
+#endif
+}
+
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 13 of 28] bail out gup_fast on freezed pmd
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 12 of 28] special pmd_trans_* functions Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18 18:59   ` Mel Gorman
  2009-12-17 19:00 ` [PATCH 14 of 28] pte alloc trans splitting Andrea Arcangeli
                   ` (16 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is freezed, not only
if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -156,7 +156,7 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 14 of 28] pte alloc trans splitting
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 13 of 28] bail out gup_fast on freezed pmd Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18 19:03   ` Mel Gorman
  2009-12-17 19:00 ` [PATCH 15 of 28] add pmd mmu_notifier helpers Andrea Arcangeli
                   ` (15 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -948,7 +948,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1017,12 +1018,14 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -324,9 +324,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -346,14 +348,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -366,10 +372,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3020,7 +3027,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 15 of 28] add pmd mmu_notifier helpers
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 14 of 28] pte alloc trans splitting Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 16 of 28] clear page compound Andrea Arcangeli
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_SIZE);	\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_SIZE);	\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 16 of 28] clear page compound
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 15 of 28] add pmd mmu_notifier helpers Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 17 of 28] add pmd_huge_pte to mm_struct Andrea Arcangeli
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -347,7 +347,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -355,6 +355,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -392,6 +399,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(page->flags & PG_head_tail_mask != (1L << PG_compound));
+	ClearPageCompound(page);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 17 of 28] add pmd_huge_pte to mm_struct
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 16 of 28] clear page compound Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 18 of 28] ensure mapcount is taken on head pages Andrea Arcangeli
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,6 +287,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,6 +498,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -638,6 +641,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 18 of 28] ensure mapcount is taken on head pages
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 17 of 28] add pmd_huge_pte to mm_struct Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 19 of 28] split_huge_page_mm/vma Andrea Arcangeli
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Unlike the page count, the page mapcount cannot be taken on PageTail compound
pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -105,6 +105,7 @@ void page_remove_rmap(struct page *);
 
 static inline void page_dup_rmap(struct page *page)
 {
+	VM_BUG_ON(PageTail(page));
 	atomic_inc(&page->_mapcount);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -733,6 +733,7 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
+	VM_BUG_ON(PageTail(page));
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
@@ -1281,6 +1282,7 @@ static int rmap_walk_file(struct page *p
 int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
+	VM_BUG_ON(PageTail(page));
 	VM_BUG_ON(!PageLocked(page));
 
 	if (unlikely(PageKsm(page)))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 19 of 28] split_huge_page_mm/vma
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 18 of 28] ensure mapcount is taken on head pages Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 20 of 28] split_huge_page paging Andrea Arcangeli
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_mm/vma compat code. Each one of those would need to be expanded
to hundred of lines of complex code without a fully reliable
split_huge_page_mm/vma functionality.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_mm(mm, 0xA0000, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,6 +446,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_vma(vma, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -132,6 +132,7 @@ static long do_mincore(unsigned long add
 	if (pud_none_or_clear_bad(pud))
 		goto none_mapped;
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_vma(vma, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto none_mapped;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -89,6 +89,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,6 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_mm(mm, addr, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(walk->mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 20 of 28] split_huge_page paging
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 19 of 28] split_huge_page_mm/vma Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 21 of 28] pmd_trans_huge migrate bugcheck Andrea Arcangeli
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1176,6 +1176,10 @@ int try_to_unmap(struct page *page, enum
 
 	BUG_ON(!PageLocked(page));
 
+	if (unlikely(PageCompound(page)))
+		if (unlikely(split_huge_page(page)))
+			return SWAP_AGAIN;
+
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
 	else if (PageAnon(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -152,6 +152,10 @@ int add_to_swap(struct page *page)
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageUptodate(page));
 
+	if (unlikely(PageCompound(page)))
+		if (unlikely(split_huge_page(page)))
+			return 0;
+
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -905,6 +905,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 21 of 28] pmd_trans_huge migrate bugcheck
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 20 of 28] split_huge_page paging Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 22 of 28] clear_huge_page fix Andrea Arcangeli
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
try_to_unmap will split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -99,6 +99,7 @@ static int remove_migration_pte(struct p
 		goto out;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 22 of 28] clear_huge_page fix
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 21 of 28] pmd_trans_huge migrate bugcheck Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18 19:16   ` Mel Gorman
  2009-12-17 19:00 ` [PATCH 23 of 28] clear_copy_huge_page Andrea Arcangeli
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

sz is in bytes, MAX_ORDER_NR_PAGES is in pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -402,7 +402,7 @@ static void clear_huge_page(struct page 
 {
 	int i;
 
-	if (unlikely(sz > MAX_ORDER_NR_PAGES)) {
+	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, sz);
 		return;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 23 of 28] clear_copy_huge_page
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 22 of 28] clear_huge_page fix Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 24 of 28] kvm mmu transparent hugepage support Andrea Arcangeli
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1375,5 +1375,14 @@ extern void shake_page(struct page *p, i
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_huge_page(struct page *dst, struct page *src,
+			   unsigned long addr, struct vm_area_struct *vma,
+			   unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,70 +385,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-static void copy_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2334,7 +2270,8 @@ retry_avoidcopy:
 		return -PTR_ERR(new_page);
 	}
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_huge_page(new_page, old_page, address, vma,
+		       pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3396,3 +3396,73 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_gigantic_page(struct page *dst, struct page *src,
+			       unsigned long addr,
+			       struct vm_area_struct *vma,
+			       unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+void copy_huge_page(struct page *dst, struct page *src,
+		    unsigned long addr, struct vm_area_struct *vma,
+		    unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src, addr, vma, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE,
+				   vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 24 of 28] kvm mmu transparent hugepage support
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 23 of 28] clear_copy_huge_page Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 25 of 28] transparent hugepage core Andrea Arcangeli
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -489,6 +489,15 @@ static int host_mapping_level(struct kvm
 out:
 	up_read(&current->mm->mmap_sem);
 
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE) {
+		struct page *page = gfn_to_page(kvm, gfn);
+
+		if (!is_error_page(page) && PageHead(page))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_page_clean(page);
+	}
+
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
 		if (page_size >= KVM_HPAGE_SIZE(i))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 25 of 28] transparent hugepage core
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 24 of 28] kvm mmu transparent hugepage support Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18 20:03   ` Mel Gorman
  2010-01-04  6:16   ` Daisuke Nishimura
  2009-12-17 19:00 ` [PATCH 26 of 28] madvise(MADV_HUGEPAGE) Andrea Arcangeli
                   ` (4 subsequent siblings)
  29 siblings, 2 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
   null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,110 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_anonymous_page(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
+			      struct vm_area_struct *vma,
+			      pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &				\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)			       \
+	(transparent_hugepage_flags &				       \
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
+	 (transparent_hugepage_flags &				       \
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#define HPAGE_ORDER (HPAGE_SHIFT-PAGE_SHIFT)
+#define HPAGE_NR (1<<HPAGE_ORDER)
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -234,6 +234,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,792 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init ksm_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(ksm_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_anonymous_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
+				    unsigned long address, pmd_t *pmd,
+				    struct page *page,
+				    unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_NR);
+
+	__SetPageUptodate(page);
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+	}
+	spin_unlock(&mm->page_table_lock);
+	
+	return ret;
+}
+
+int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+				   (transparent_hugepage_defrag(vma) ?
+				    __GFP_REPEAT : 0)|__GFP_NOWARN,
+				   HPAGE_ORDER);
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_anonymous_page(mm, vma,
+						address, pmd,
+						page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_pgtable(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, HPAGE_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL; /* debug */
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0, i;
+	struct page *page, *new_page;
+	unsigned long haddr;
+	struct page **pages;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_pgtable(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			       (transparent_hugepage_defrag(vma) ?
+				__GFP_REPEAT : 0)|__GFP_NOWARN,
+			       HPAGE_ORDER);
+	if (transparent_hugepage_debug_cow() && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+	if (unlikely(!new_page)) {
+		pgtable_t pgtable;
+		pmd_t _pmd;
+
+		pages = kzalloc(sizeof(struct page *) * HPAGE_NR,
+				GFP_KERNEL);
+		if (unlikely(!pages)) {
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+		
+		for (i = 0; i < HPAGE_NR; i++) {
+			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+						  vma, address);
+			if (unlikely(!pages[i])) {
+				while (--i >= 0)
+					put_page(pages[i]);
+				kfree(pages);
+				ret |= VM_FAULT_OOM;
+				goto out;
+			}
+		}
+
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd)))
+			goto out_free_pages;
+		else
+			get_page(page);
+		spin_unlock(&mm->page_table_lock);
+
+		might_sleep();
+		for (i = 0; i < HPAGE_NR; i++) {
+			copy_user_highpage(pages[i], page + i,
+					   haddr + PAGE_SHIFT*i, vma);
+			__SetPageUptodate(pages[i]);
+			cond_resched();
+		}
+
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd)))
+			goto out_free_pages;
+		else
+			put_page(page);
+
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		/* leave pmd empty until pte is filled */
+
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0; i < HPAGE_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			entry = mk_pte(pages[i], vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			page_add_new_anon_rmap(pages[i], vma, haddr);
+			pte = pte_offset_map(&_pmd, haddr);
+			VM_BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+		kfree(pages);
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		spin_unlock(&mm->page_table_lock);
+
+		ret |= VM_FAULT_WRITE;
+		page_remove_rmap(page);
+		put_page(page);
+		goto out;
+	}
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
+	__SetPageUptodate(new_page);
+
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+
+out_free_pages:
+	for (i = 0; i < HPAGE_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out_unlock;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_pgtable(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_pmd_trans_huge(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		       pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pfn_to_page(pmd_pfn(*pmd));
+			VM_BUG_ON(!PageCompound(page));
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			spin_unlock(&tlb->mm->page_table_lock);
+			add_mm_counter(tlb->mm, anon_rss, -HPAGE_NR);
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_pgtable(*pmd) == page) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, pmd_huge must remain on at all times.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+		page_tail->index = ++head_index;
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		if (page_evictable(page_tail, NULL))
+			lru_cache_add_lru(page_tail, LRU_ACTIVE_ANON);
+		else
+			add_page_to_unevictable_list(page_tail);
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+	BUG_ON(down_write_trylock(&mm->mmap_sem));
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_pgtable(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+ 	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -647,9 +647,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_SIZE)
+				split_huge_page_vma(vma, pmd);
+			else if (zap_pmd_trans_huge(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_anonymous_page(mm, vma, address,
+						      pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_wp_page(mm, vma, address,
+						       pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -380,9 +363,43 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageCompound(page))) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+		VM_BUG_ON(1);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 26 of 28] madvise(MADV_HUGEPAGE)
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 25 of 28] transparent hugepage core Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-17 19:00 ` [PATCH 27 of 28] memcg compound Andrea Arcangeli
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,6 +45,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -93,6 +93,7 @@ extern pmd_t *page_check_address_pmd(str
 				     unsigned long address,
 				     enum page_check_address_pmd_flag flag);
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define transparent_hugepage_flags 0UL
 static inline int split_huge_page(struct page *page)
@@ -105,6 +106,11 @@ static inline int split_huge_page(struct
 	do { }  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -106,6 +106,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -790,3 +790,19 @@ out_unlock:
 out:
 	return ret;
 }
+
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 27 of 28] memcg compound
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 26 of 28] madvise(MADV_HUGEPAGE) Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18  1:27   ` KAMEZAWA Hiroyuki
  2009-12-17 19:00 ` [PATCH 28 of 28] memcg huge memory Andrea Arcangeli
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1288,15 +1288,20 @@ static atomic_t memcg_drain_count;
  * cgroup which is not current target, returns false. This stock will be
  * refilled.
  */
-static bool consume_stock(struct mem_cgroup *mem)
+static bool consume_stock(struct mem_cgroup *mem, int *page_size)
 {
 	struct memcg_stock_pcp *stock;
 	bool ret = true;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (mem == stock->cached && stock->charge)
-		stock->charge -= PAGE_SIZE;
-	else /* need to call res_counter_charge */
+	if (mem == stock->cached && stock->charge) {
+		if (*page_size > stock->charge) {
+			*page_size -= stock->charge;
+			stock->charge = 0;
+			ret = false;
+		} else
+			stock->charge -= *page_size;
+	} else /* need to call res_counter_charge */
 		ret = false;
 	put_cpu_var(memcg_stock);
 	return ret;
@@ -1401,13 +1406,13 @@ static int __cpuinit memcg_stock_cpu_cal
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg,
-			bool oom, struct page *page)
+				   gfp_t gfp_mask, struct mem_cgroup **memcg,
+				   bool oom, struct page *page, int page_size)
 {
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
-	int csize = CHARGE_SIZE;
+	int csize = max(page_size, (int) CHARGE_SIZE);
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1439,7 +1444,7 @@ static int __mem_cgroup_try_charge(struc
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (consume_stock(mem))
+		if (consume_stock(mem, &page_size))
 			goto charged;
 
 		ret = res_counter_charge(&mem->res, csize, &fail_res);
@@ -1460,8 +1465,8 @@ static int __mem_cgroup_try_charge(struc
 									res);
 
 		/* reduce request size and retry */
-		if (csize > PAGE_SIZE) {
-			csize = PAGE_SIZE;
+		if (csize > page_size) {
+			csize = page_size;
 			continue;
 		}
 		if (!(gfp_mask & __GFP_WAIT))
@@ -1491,8 +1496,8 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
-	if (csize > PAGE_SIZE)
-		refill_stock(mem, csize - PAGE_SIZE);
+	if (csize > page_size)
+		refill_stock(mem, csize - page_size);
 charged:
 	/*
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1512,12 +1517,12 @@ nomem:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge(&mem->res, page_size);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge(&mem->memsw, page_size);
 	}
 	css_put(&mem->css);
 }
@@ -1575,8 +1580,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -1585,7 +1591,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -1722,7 +1728,8 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
+		PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
@@ -1730,7 +1737,7 @@ static int mem_cgroup_move_parent(struct
 	if (!ret)
 		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
 	else
-		mem_cgroup_cancel_charge(parent);	/* does css_put */
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
 put_back:
 	putback_lru_page(page);
 put:
@@ -1752,6 +1759,11 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	VM_BUG_ON(PageTail(page));
+	if (PageHead(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -1760,11 +1772,12 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
+				      page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -1773,8 +1786,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -1787,7 +1798,7 @@ int mem_cgroup_newpage_charge(struct pag
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 static void
@@ -1880,14 +1891,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
 }
 
 static void
@@ -1903,7 +1914,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -1952,11 +1963,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -1989,14 +2001,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	return;
 }
 
@@ -2009,6 +2021,11 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	VM_BUG_ON(PageTail(page));
+	if (PageHead(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2016,6 +2033,8 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageSwapCache(page))
 		return NULL;
 
+	VM_BUG_ON(PageTail(page));
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2048,7 +2067,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -2217,7 +2236,7 @@ int mem_cgroup_prepare_migration(struct 
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
-						page);
+					      page, PAGE_SIZE);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2260,7 +2279,7 @@ void mem_cgroup_end_migration(struct mem
 	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
 	 * So, double-counting is effectively avoided.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 
 	/*
 	 * Both of oldpage and newpage are still under lock_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 28 of 28] memcg huge memory
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 27 of 28] memcg compound Andrea Arcangeli
@ 2009-12-17 19:00 ` Andrea Arcangeli
  2009-12-18  1:33   ` KAMEZAWA Hiroyuki
  2009-12-24 10:00   ` Balbir Singh
  2009-12-17 19:54 ` [PATCH 00 of 28] Transparent Hugepage support #2 Christoph Lameter
  2009-12-18 18:47 ` Dave Hansen
  29 siblings, 2 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-17 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -207,6 +207,7 @@ static int __do_huge_anonymous_page(stru
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, address);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -218,6 +219,7 @@ static int __do_huge_anonymous_page(stru
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -251,6 +253,10 @@ int do_huge_anonymous_page(struct mm_str
 				   HPAGE_ORDER);
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_anonymous_page(mm, vma,
 						address, pmd,
@@ -379,9 +385,16 @@ int do_huge_wp_page(struct mm_struct *mm
 		for (i = 0; i < HPAGE_NR; i++) {
 			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						  vma, address);
-			if (unlikely(!pages[i])) {
-				while (--i >= 0)
+			if (unlikely(!pages[i] ||
+				     mem_cgroup_newpage_charge(pages[i],
+							       mm,
+							       GFP_KERNEL))) {
+				if (pages[i])
 					put_page(pages[i]);
+				while (--i >= 0) {
+					mem_cgroup_uncharge_page(pages[i]);
+					put_page(pages[i]);
+				}
 				kfree(pages);
 				ret |= VM_FAULT_OOM;
 				goto out;
@@ -439,15 +452,21 @@ int do_huge_wp_page(struct mm_struct *mm
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
 	__SetPageUptodate(new_page);
 
 	smp_wmb();
 
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -466,8 +485,10 @@ out:
 	return ret;
 
 out_free_pages:
-	for (i = 0; i < HPAGE_NR; i++)
+	for (i = 0; i < HPAGE_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out_unlock;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01 of 28] compound_lock
  2009-12-17 19:00 ` [PATCH 01 of 28] compound_lock Andrea Arcangeli
@ 2009-12-17 19:46   ` Christoph Lameter
  2009-12-18 14:27     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2009-12-17 19:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright

On Thu, 17 Dec 2009, Andrea Arcangeli wrote:

>  	if (unlikely(PageTail(page)))
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -108,6 +108,7 @@ enum pageflags {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>  #endif
> +	PG_compound_lock,
>  	__NR_PAGEFLAGS,

Eats up a rare page bit.

#ifdef CONFIG_TRANSP_HUGE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 02 of 28] alter compound get_page/put_page
  2009-12-17 19:00 ` [PATCH 02 of 28] alter compound get_page/put_page Andrea Arcangeli
@ 2009-12-17 19:50   ` Christoph Lameter
  2009-12-18 14:30     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2009-12-17 19:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright


Additional cachelines are dirtied in performance critical VM primitives
now. Increases cache footprint etc.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2009-12-17 19:00 ` [PATCH 28 of 28] memcg huge memory Andrea Arcangeli
@ 2009-12-17 19:54 ` Christoph Lameter
  2009-12-17 19:58   ` Rik van Riel
  2009-12-18 18:47 ` Dave Hansen
  29 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2009-12-17 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright

Would it be possible to start out with a version of huge page support that
does not require the complex splitting and joining of huge pages?

Without that we would not need additional refcounts.

Maybe a patch to allow simply the use of anonymous huge pages without a
hugetlbfs mmap in the middle? IMHO its useful even if we cannot swap it
out.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:54 ` [PATCH 00 of 28] Transparent Hugepage support #2 Christoph Lameter
@ 2009-12-17 19:58   ` Rik van Riel
  2009-12-17 20:09     ` Christoph Lameter
                       ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Rik van Riel @ 2009-12-17 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

Christoph Lameter wrote:
> Would it be possible to start out with a version of huge page support that
> does not require the complex splitting and joining of huge pages?
> 
> Without that we would not need additional refcounts.
> 
> Maybe a patch to allow simply the use of anonymous huge pages without a
> hugetlbfs mmap in the middle? IMHO its useful even if we cannot swap it
> out.

Christoph, we need a way to swap these anonymous huge
pages.  You make it look as if you just want the
anonymous huge pages and a way to then veto any attempts
to make them swappable (on account of added overhead).

I believe it will be more useful if we figure out a way
forward together.  Do you have any ideas on how to solve
the hugepage swapping problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:58   ` Rik van Riel
@ 2009-12-17 20:09     ` Christoph Lameter
  2009-12-18  5:12       ` Ingo Molnar
  2009-12-18 14:05       ` [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
  2009-12-17 20:47     ` Mike Travis
  2009-12-18 12:52     ` Avi Kivity
  2 siblings, 2 replies; 89+ messages in thread
From: Christoph Lameter @ 2009-12-17 20:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, 17 Dec 2009, Rik van Riel wrote:

> Christoph Lameter wrote:
> > Would it be possible to start out with a version of huge page support that
> > does not require the complex splitting and joining of huge pages?
> >
> > Without that we would not need additional refcounts.
> >
> > Maybe a patch to allow simply the use of anonymous huge pages without a
> > hugetlbfs mmap in the middle? IMHO its useful even if we cannot swap it
> > out.
>
> Christoph, we need a way to swap these anonymous huge
> pages.  You make it look as if you just want the
> anonymous huge pages and a way to then veto any attempts
> to make them swappable (on account of added overhead).

Can we do this step by step? This splitting thing and its
associated overhead causes me concerns.

> I believe it will be more useful if we figure out a way
> forward together.  Do you have any ideas on how to solve
> the hugepage swapping problem?

Frankly I am not sure that there is a problem. The word swap is mostly
synonymous with "problem". Huge pages are good. I dont think one
needs to necessarily associate something good (huge page) with a known
problem (swap) otherwise the whole may not improve.






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:58   ` Rik van Riel
  2009-12-17 20:09     ` Christoph Lameter
@ 2009-12-17 20:47     ` Mike Travis
  2009-12-18  3:28       ` Rik van Riel
  2009-12-18 14:12       ` Andrea Arcangeli
  2009-12-18 12:52     ` Avi Kivity
  2 siblings, 2 replies; 89+ messages in thread
From: Mike Travis @ 2009-12-17 20:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton



Rik van Riel wrote:
> Christoph Lameter wrote:
>> Would it be possible to start out with a version of huge page support 
>> that
>> does not require the complex splitting and joining of huge pages?
>>
>> Without that we would not need additional refcounts.
>>
>> Maybe a patch to allow simply the use of anonymous huge pages without a
>> hugetlbfs mmap in the middle? IMHO its useful even if we cannot swap it
>> out.
> 
> Christoph, we need a way to swap these anonymous huge
> pages.  You make it look as if you just want the
> anonymous huge pages and a way to then veto any attempts
> to make them swappable (on account of added overhead).

On very large SMP systems with huge amounts of memory, the
gains from huge pages will be significant.  And swapping
will not be an issue.  I agree that the two should be
split up and perhaps even make swapping an option?

> 
> I believe it will be more useful if we figure out a way
> forward together.  Do you have any ideas on how to solve
> the hugepage swapping problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 27 of 28] memcg compound
  2009-12-17 19:00 ` [PATCH 27 of 28] memcg compound Andrea Arcangeli
@ 2009-12-18  1:27   ` KAMEZAWA Hiroyuki
  2009-12-18 16:02     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-18  1:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 17 Dec 2009 19:00:30 -0000
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Teach memcg to charge/uncharge compound pages.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Hmm.

> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1288,15 +1288,20 @@ static atomic_t memcg_drain_count;
>   * cgroup which is not current target, returns false. This stock will be
>   * refilled.
>   */
> -static bool consume_stock(struct mem_cgroup *mem)
> +static bool consume_stock(struct mem_cgroup *mem, int *page_size)
>  {
>  	struct memcg_stock_pcp *stock;
>  	bool ret = true;
>  
>  	stock = &get_cpu_var(memcg_stock);
> -	if (mem == stock->cached && stock->charge)
> -		stock->charge -= PAGE_SIZE;
> -	else /* need to call res_counter_charge */
> +	if (mem == stock->cached && stock->charge) {
> +		if (*page_size > stock->charge) {
> +			*page_size -= stock->charge;
> +			stock->charge = 0;
> +			ret = false;
> +		} else
> +			stock->charge -= *page_size;
> +	} else /* need to call res_counter_charge */
>  		ret = false;

I feel we should we skip this per-cpu caching method because counter overflow
rate is the key for this workaround.
Then,
	if (size == PAGESIZE)
		consume_stock()
seems better to me.



>  	put_cpu_var(memcg_stock);
>  	return ret;
> @@ -1401,13 +1406,13 @@ static int __cpuinit memcg_stock_cpu_cal
>   * oom-killer can be invoked.
>   */
>  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> -			gfp_t gfp_mask, struct mem_cgroup **memcg,
> -			bool oom, struct page *page)
> +				   gfp_t gfp_mask, struct mem_cgroup **memcg,
> +				   bool oom, struct page *page, int page_size)
>  {
>  	struct mem_cgroup *mem, *mem_over_limit;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct res_counter *fail_res;
> -	int csize = CHARGE_SIZE;
> +	int csize = max(page_size, (int) CHARGE_SIZE);
>  
we need max() ?

>  	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
>  		/* Don't account this! */
> @@ -1439,7 +1444,7 @@ static int __mem_cgroup_try_charge(struc
>  		int ret = 0;
>  		unsigned long flags = 0;
>  
> -		if (consume_stock(mem))
> +		if (consume_stock(mem, &page_size))
>  			goto charged;
>  
I think should skip this.

>  		ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1460,8 +1465,8 @@ static int __mem_cgroup_try_charge(struc
>  									res);
>  
>  		/* reduce request size and retry */
> -		if (csize > PAGE_SIZE) {
> -			csize = PAGE_SIZE;
> +		if (csize > page_size) {
> +			csize = page_size;
>  			continue;
>  		}
>  		if (!(gfp_mask & __GFP_WAIT))
> @@ -1491,8 +1496,8 @@ static int __mem_cgroup_try_charge(struc
>  			goto nomem;
>  		}
>  	}
> -	if (csize > PAGE_SIZE)
> -		refill_stock(mem, csize - PAGE_SIZE);
> +	if (csize > page_size)
> +		refill_stock(mem, csize - page_size);

And skip this.

>  charged:
>  	/*
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> @@ -1512,12 +1517,12 @@ nomem:
>   * This function is for that and do uncharge, put css's refcnt.
>   * gotten by try_charge().
>   */
> -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem, int page_size)
>  {
>  	if (!mem_cgroup_is_root(mem)) {
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		res_counter_uncharge(&mem->res, page_size);
>  		if (do_swap_account)
> -			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +			res_counter_uncharge(&mem->memsw, page_size);
>  	}
>  	css_put(&mem->css);
>  }
> @@ -1575,8 +1580,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>   */
>  
>  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> -				     struct page_cgroup *pc,
> -				     enum charge_type ctype)
> +				       struct page_cgroup *pc,
> +				       enum charge_type ctype,
> +				       int page_size)
>  {
>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>  	if (!mem)
> @@ -1585,7 +1591,7 @@ static void __mem_cgroup_commit_charge(s
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
> -		mem_cgroup_cancel_charge(mem);
> +		mem_cgroup_cancel_charge(mem, page_size);
>  		return;
>  	}
>  
> @@ -1722,7 +1728,8 @@ static int mem_cgroup_move_parent(struct
>  		goto put;
>  
>  	parent = mem_cgroup_from_cont(pcg);
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
> +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
> +		PAGE_SIZE);
>  	if (ret || !parent)
>  		goto put_back;
>  
> @@ -1730,7 +1737,7 @@ static int mem_cgroup_move_parent(struct
>  	if (!ret)
>  		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
>  	else
> -		mem_cgroup_cancel_charge(parent);	/* does css_put */
> +		mem_cgroup_cancel_charge(parent, PAGE_SIZE); /* does css_put */
>  put_back:
>  	putback_lru_page(page);
>  put:

Ah..Hmm...this will be much complicated after Nishimura's "task move" method
is merged. But ok, for this patch itself.


> @@ -1752,6 +1759,11 @@ static int mem_cgroup_charge_common(stru
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  	int ret;
> +	int page_size = PAGE_SIZE;
> +
> +	VM_BUG_ON(PageTail(page));
> +	if (PageHead(page))
> +		page_size <<= compound_order(page);
>  
>  	pc = lookup_page_cgroup(page);
>  	/* can happen at boot */
> @@ -1760,11 +1772,12 @@ static int mem_cgroup_charge_common(stru
>  	prefetchw(pc);
>  
>  	mem = memcg;
> -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
> +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page,
> +				      page_size);
>  	if (ret || !mem)
>  		return ret;
>  
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
>  	return 0;
>  }
>  
> @@ -1773,8 +1786,6 @@ int mem_cgroup_newpage_charge(struct pag
>  {
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  	/*
>  	 * If already mapped, we don't have to account.
>  	 * If page cache, page->mapping has address_space.
> @@ -1787,7 +1798,7 @@ int mem_cgroup_newpage_charge(struct pag
>  	if (unlikely(!mm))
>  		mm = &init_mm;
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
>  
>  static void
> @@ -1880,14 +1891,14 @@ int mem_cgroup_try_charge_swapin(struct 
>  	if (!mem)
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, page, PAGE_SIZE);
>  	/* drop extra refcnt from tryget */
>  	css_put(&mem->css);
>  	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> -	return __mem_cgroup_try_charge(mm, mask, ptr, true, page);
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true, page, PAGE_SIZE);
>  }
>  
>  static void
> @@ -1903,7 +1914,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  	cgroup_exclude_rmdir(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
> -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
>  	mem_cgroup_lru_add_after_commit_swapcache(page);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
> @@ -1952,11 +1963,12 @@ void mem_cgroup_cancel_charge_swapin(str
>  		return;
>  	if (!mem)
>  		return;
> -	mem_cgroup_cancel_charge(mem);
> +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
>  }
>  
>  static void
> -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> +	      int page_size)
>  {
>  	struct memcg_batch_info *batch = NULL;
>  	bool uncharge_memsw = true;
> @@ -1989,14 +2001,14 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += PAGE_SIZE;
> +	batch->bytes += page_size;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += PAGE_SIZE;
> +		batch->memsw_bytes += page_size;
>  	return;
>  direct_uncharge:
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, page_size);
>  	if (uncharge_memsw)
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, page_size);
>  	return;
>  }
>  
> @@ -2009,6 +2021,11 @@ __mem_cgroup_uncharge_common(struct page
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
>  	struct mem_cgroup_per_zone *mz;
> +	int page_size = PAGE_SIZE;
> +
> +	VM_BUG_ON(PageTail(page));
> +	if (PageHead(page))
> +		page_size <<= compound_order(page);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2016,6 +2033,8 @@ __mem_cgroup_uncharge_common(struct page
>  	if (PageSwapCache(page))
>  		return NULL;
>  
> +	VM_BUG_ON(PageTail(page));
> +
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> @@ -2048,7 +2067,7 @@ __mem_cgroup_uncharge_common(struct page
>  	}
>  
>  	if (!mem_cgroup_is_root(mem))
> -		__do_uncharge(mem, ctype);
> +		__do_uncharge(mem, ctype, page_size);
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>  		mem_cgroup_swap_statistics(mem, true);
>  	mem_cgroup_charge_statistics(mem, pc, false);
> @@ -2217,7 +2236,7 @@ int mem_cgroup_prepare_migration(struct 
>  
>  	if (mem) {
>  		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> -						page);
> +					      page, PAGE_SIZE);
>  		css_put(&mem->css);
>  	}
>  	*ptr = mem;
> @@ -2260,7 +2279,7 @@ void mem_cgroup_end_migration(struct mem
>  	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
>  	 * So, double-counting is effectively avoided.
>  	 */
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
>  
>  	/*
>  	 * Both of oldpage and newpage are still under lock_page().
> 

Thank you! Seems simpler than expected!

Regards,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-17 19:00 ` [PATCH 28 of 28] memcg huge memory Andrea Arcangeli
@ 2009-12-18  1:33   ` KAMEZAWA Hiroyuki
  2009-12-18 16:04     ` Andrea Arcangeli
  2009-12-24 10:00   ` Balbir Singh
  1 sibling, 1 reply; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-18  1:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Thu, 17 Dec 2009 19:00:31 -0000
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add memcg charge/uncharge to hugepage faults in huge_memory.c.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Seems nice.

Then, maybe we (I?) should cut this part (and some from 27/28) out and
merge into memcg. It will be helpful to all your work.

But I don't like a situation which memcg's charge are filled with _locked_ memory.
(Especially, bad-configured softlimit+hugepage will adds much regression.)
New counter as "usage of huge page" will be required for memcg, at least.

Thanks,
-Kame

> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -207,6 +207,7 @@ static int __do_huge_anonymous_page(stru
>  	VM_BUG_ON(!PageCompound(page));
>  	pgtable = pte_alloc_one(mm, address);
>  	if (unlikely(!pgtable)) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -218,6 +219,7 @@ static int __do_huge_anonymous_page(stru
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -251,6 +253,10 @@ int do_huge_anonymous_page(struct mm_str
>  				   HPAGE_ORDER);
>  		if (unlikely(!page))
>  			goto out;
> +		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
> +			put_page(page);
> +			goto out;
> +		}
>  
>  		return __do_huge_anonymous_page(mm, vma,
>  						address, pmd,
> @@ -379,9 +385,16 @@ int do_huge_wp_page(struct mm_struct *mm
>  		for (i = 0; i < HPAGE_NR; i++) {
>  			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>  						  vma, address);
> -			if (unlikely(!pages[i])) {
> -				while (--i >= 0)
> +			if (unlikely(!pages[i] ||
> +				     mem_cgroup_newpage_charge(pages[i],
> +							       mm,
> +							       GFP_KERNEL))) {
> +				if (pages[i])
>  					put_page(pages[i]);
> +				while (--i >= 0) {
> +					mem_cgroup_uncharge_page(pages[i]);
> +					put_page(pages[i]);
> +				}
>  				kfree(pages);
>  				ret |= VM_FAULT_OOM;
>  				goto out;
> @@ -439,15 +452,21 @@ int do_huge_wp_page(struct mm_struct *mm
>  		goto out;
>  	}
>  
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		put_page(new_page);
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
>  	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
>  	__SetPageUptodate(new_page);
>  
>  	smp_wmb();
>  
>  	spin_lock(&mm->page_table_lock);
> -	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> +		mem_cgroup_uncharge_page(new_page);
>  		put_page(new_page);
> -	else {
> +	} else {
>  		pmd_t entry;
>  		entry = mk_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -466,8 +485,10 @@ out:
>  	return ret;
>  
>  out_free_pages:
> -	for (i = 0; i < HPAGE_NR; i++)
> +	for (i = 0; i < HPAGE_NR; i++) {
> +		mem_cgroup_uncharge_page(pages[i]);
>  		put_page(pages[i]);
> +	}
>  	kfree(pages);
>  	goto out_unlock;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 20:47     ` Mike Travis
@ 2009-12-18  3:28       ` Rik van Riel
  2009-12-18 14:12       ` Andrea Arcangeli
  1 sibling, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2009-12-18  3:28 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On 12/17/2009 03:47 PM, Mike Travis wrote:
> Rik van Riel wrote:
>> Christoph Lameter wrote:

>> Christoph, we need a way to swap these anonymous huge
>> pages. You make it look as if you just want the
>> anonymous huge pages and a way to then veto any attempts
>> to make them swappable (on account of added overhead).
>
> On very large SMP systems with huge amounts of memory, the
> gains from huge pages will be significant. And swapping
> will not be an issue. I agree that the two should be
> split up and perhaps even make swapping an option?

With virtualization, people generally want to oversubscribe
their systems a little bit.  This makes swapping pretty much
a mandatory feature.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 20:09     ` Christoph Lameter
@ 2009-12-18  5:12       ` Ingo Molnar
  2009-12-18  6:18         ` KOSAKI Motohiro
  2009-12-18 18:28         ` Christoph Lameter
  2009-12-18 14:05       ` [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
  1 sibling, 2 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-12-18  5:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton,
	Stephen C. Tweedie


* Christoph Lameter <cl@linux-foundation.org> wrote:

> On Thu, 17 Dec 2009, Rik van Riel wrote:
> 
> > I believe it will be more useful if we figure out a way forward together.  
> > Do you have any ideas on how to solve the hugepage swapping problem?
> 
> Frankly I am not sure that there is a problem. The word swap is mostly 
> synonymous with "problem". Huge pages are good. I dont think one needs to 
> necessarily associate something good (huge page) with a known problem (swap) 
> otherwise the whole may not improve.

Swapping in the VM is 'reality', not some fringe feature. Almost every big 
enterprise shop cares about it.

Note that it became more relevant in the past few years due to the arrival of 
low-latency, lots-of-iops and cheap SSDs. Even on a low end server you can buy 
a good 160 GB SSD for emergency swap with fantastic latency and for a lot less 
money than 160 GB of real RAM. (which RAM wont even fit physically on typical 
mainboards, is much more expensive and uses up more power and is less 
servicable)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18  5:12       ` Ingo Molnar
@ 2009-12-18  6:18         ` KOSAKI Motohiro
  2009-12-18 18:28         ` Christoph Lameter
  1 sibling, 0 replies; 89+ messages in thread
From: KOSAKI Motohiro @ 2009-12-18  6:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Christoph Lameter, Rik van Riel,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton,
	Stephen C. Tweedie

> 
> * Christoph Lameter <cl@linux-foundation.org> wrote:
> 
> > On Thu, 17 Dec 2009, Rik van Riel wrote:
> > 
> > > I believe it will be more useful if we figure out a way forward together.  
> > > Do you have any ideas on how to solve the hugepage swapping problem?
> > 
> > Frankly I am not sure that there is a problem. The word swap is mostly 
> > synonymous with "problem". Huge pages are good. I dont think one needs to 
> > necessarily associate something good (huge page) with a known problem (swap) 
> > otherwise the whole may not improve.
> 
> Swapping in the VM is 'reality', not some fringe feature. Almost every big 
> enterprise shop cares about it.
> 
> Note that it became more relevant in the past few years due to the arrival of 
> low-latency, lots-of-iops and cheap SSDs. Even on a low end server you can buy 
> a good 160 GB SSD for emergency swap with fantastic latency and for a lot less 
> money than 160 GB of real RAM. (which RAM wont even fit physically on typical 
> mainboards, is much more expensive and uses up more power and is less 
> servicable)

Agreed. This isn't artificial example. Recently I've heared some
major web service campany use case in japan. they use lots cheap
server (about $700 per server). then they said few memory and ssd
are good choice to them.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:58   ` Rik van Riel
  2009-12-17 20:09     ` Christoph Lameter
  2009-12-17 20:47     ` Mike Travis
@ 2009-12-18 12:52     ` Avi Kivity
  2 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2009-12-18 12:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Lameter, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On 12/17/2009 09:58 PM, Rik van Riel wrote:
>> Maybe a patch to allow simply the use of anonymous huge pages without a
>> hugetlbfs mmap in the middle? IMHO its useful even if we cannot swap it
>> out.
>
>
> Christoph, we need a way to swap these anonymous huge
> pages.  You make it look as if you just want the
> anonymous huge pages and a way to then veto any attempts
> to make them swappable (on account of added overhead).

On top of swap, we want ballooning and samepage merging to work with 
large pages.

As others have noted, swap may come back into fashion using ssds 
(assuming ssds are significantly cheaper than RAM).

There is also ramzswap which is plenty fast.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 20:09     ` Christoph Lameter
  2009-12-18  5:12       ` Ingo Molnar
@ 2009-12-18 14:05       ` Andrea Arcangeli
  2009-12-18 18:33         ` Christoph Lameter
  1 sibling, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 14:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, Dec 17, 2009 at 02:09:47PM -0600, Christoph Lameter wrote:
> Can we do this step by step? This splitting thing and its
> associated overhead causes me concerns.

The split_huge_page* functionality whole point is exactly to do things
step by step. Removing it would mean doing it all at once.

This is like the big kernel lock when SMP initially was
introduced. Surely kernel would have been a little faster if the big
kernel lock was never introduced but over time the split_huge_page can
be removed just like the big kernel lock has been removed. Then the
PG_compound_lock can go away too.

> Frankly I am not sure that there is a problem. The word swap is mostly
> synonymous with "problem". Huge pages are good. I dont think one
> needs to necessarily associate something good (huge page) with a known
> problem (swap) otherwise the whole may not improve.

Others already answered extensively on why it is needed. Also look at
Hugh's effort to make KSM pages swappable.

Plus locking the huge pages in ram wouldn't actually remove the need
of split_huge_page for all other places in the VM that aren't hugepage
aware yet and where there is no urgency to make them swap aware
either. NOTE: especially after "echo madvise >
/sys/kernel/mm/transparent_hugepage/enabled" the risk of overhead
caused by split_huge_page is exactly zero! (well unless you swap but
at that point you're generally I/O bound or the locking on anon_vma
lock is surely bigger scalability concern than the CPU cost of
splitting, with or without split_huge_page) Also for hugetlbfs the
overhead caused by the PG_compound_lock taken on tail pages is zero
for anything but O_DIRECT, O_DIRECT is the only thing that can call
put_page on tail pages. Everything else only work with head pages and
with head pages there is zero slowdown caused by the
PG_compound_lock. This is true for transparent hugepages too in fact,
and O_DIRECT is I/O bound so the PG_compound_lock shouldn't be a big
issue given it is a per-compound-page lock and so fully
scalable. In the future mmu notifier users that calls gup will stop
using FOLL_GET and in turn they will stop calling put_page, so
eliminating any need to take the PG_compound_lock in all KVM fast paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 20:47     ` Mike Travis
  2009-12-18  3:28       ` Rik van Riel
@ 2009-12-18 14:12       ` Andrea Arcangeli
  1 sibling, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 14:12 UTC (permalink / raw)
  To: Mike Travis
  Cc: Rik van Riel, Christoph Lameter, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, Dec 17, 2009 at 12:47:34PM -0800, Mike Travis wrote:
> On very large SMP systems with huge amounts of memory, the
> gains from huge pages will be significant.  And swapping
> will not be an issue.  I agree that the two should be
> split up and perhaps even make swapping an option?

I think swapoff -a will already give you what you want without any
need to change the code. Especially using echo madvise > enabled, the
only overhead you can complain about is the need of PG_compound_lock
in put_page called by O_DIRECT I/O completion handlers, everything
else will gain nothing by disabling swap or removing split_huge_page.

Thinking at swap full, let's add this too just in case swap gets full
and we split for no gain... If add_to_swap fails try_to_unmap isn't
called hence it'll never be splitted.

diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -152,14 +152,16 @@ int add_to_swap(struct page *page)
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageUptodate(page));
 
-	if (unlikely(PageCompound(page)))
-		if (unlikely(split_huge_page(page)))
-			return 0;
-
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageCompound(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01 of 28] compound_lock
  2009-12-17 19:46   ` Christoph Lameter
@ 2009-12-18 14:27     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 14:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, Dec 17, 2009 at 01:46:50PM -0600, Christoph Lameter wrote:
> On Thu, 17 Dec 2009, Andrea Arcangeli wrote:
> 
> >  	if (unlikely(PageTail(page)))
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -108,6 +108,7 @@ enum pageflags {
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  	PG_hwpoison,		/* hardware poisoned page. Don't touch */
> >  #endif
> > +	PG_compound_lock,
> >  	__NR_PAGEFLAGS,
> 
> Eats up a rare page bit.
> 
> #ifdef CONFIG_TRANSP_HUGE?

It can't go under #ifdef unless I also put under #ifdef the whole
refcounting changes on the compound pages of patch 2/28. Let me know
if this is what you're asking: it would be very feasible to not have
the PG_compound_lock logic in the compound get_page/put_page when
CONFIG_TRANSPARENT_HUGEPAGE=n. I just thought it's not worth it
because the only slowdown introduced in get_page/put_page by 2/28 for
hugetlbfs happens on O_DIRECT completion handlers. The reason
hugetlbfs can implement a backwards compatible get_page/put_page
without the need of PG_compound_lock and without the need of
refcounting how many pins there are on each tail page, is that the
hugepage managed by hugetlbfs can't be splitted and swapped out. So I
can optimize away that PG_compound_lock with
CONFIG_TRANSPARENT_HUGEPAGE=n if you want.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 02 of 28] alter compound get_page/put_page
  2009-12-17 19:50   ` Christoph Lameter
@ 2009-12-18 14:30     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 14:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Thu, Dec 17, 2009 at 01:50:10PM -0600, Christoph Lameter wrote:
> 
> Additional cachelines are dirtied in performance critical VM primitives
> now. Increases cache footprint etc.

Only slowdown added is to put_page called on compound _tail_
pages. Everything runs as fast as always on regular pages, hugetlbfs
head pages, and transparent head hugepages too the same way. The only
thing that ever calls a put_page on a compound tail page is O_DIRECT
I/O completion handler which is all but performance critical given it
is I/O dominated.

The ones that aren't I/O dominated and that don't deal with I/O DMA
(like KVM minor fault and GRU tlb miss handler), must start using mmu
notifier and stop calling gup with FOLL_GET and not ever need to call
put_page at all, so they will run faster with or without 2/28 (and
they won't screw with KSM merging [ksm can't merge if there are pins
on the pages to avoids screwing in-flight dma], and they will be
pageable).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 27 of 28] memcg compound
  2009-12-18  1:27   ` KAMEZAWA Hiroyuki
@ 2009-12-18 16:02     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 16:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Fri, Dec 18, 2009 at 10:27:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1288,15 +1288,20 @@ static atomic_t memcg_drain_count;
> >   * cgroup which is not current target, returns false. This stock will be
> >   * refilled.
> >   */
> > -static bool consume_stock(struct mem_cgroup *mem)
> > +static bool consume_stock(struct mem_cgroup *mem, int *page_size)
> >  {
> >  	struct memcg_stock_pcp *stock;
> >  	bool ret = true;
> >  
> >  	stock = &get_cpu_var(memcg_stock);
> > -	if (mem == stock->cached && stock->charge)
> > -		stock->charge -= PAGE_SIZE;
> > -	else /* need to call res_counter_charge */
> > +	if (mem == stock->cached && stock->charge) {
> > +		if (*page_size > stock->charge) {
> > +			*page_size -= stock->charge;
> > +			stock->charge = 0;
> > +			ret = false;
> > +		} else
> > +			stock->charge -= *page_size;
> > +	} else /* need to call res_counter_charge */
> >  		ret = false;
> 
> I feel we should we skip this per-cpu caching method because counter overflow
> rate is the key for this workaround.
> Then,
> 	if (size == PAGESIZE)
> 		consume_stock()
> seems better to me.

Ok, I did it the way I did it, to be sure to never underestimate the
still available reserved space. Wasting 128k per cgroup seems no big
deal to me, so I can skip it. Clearly performace-wise including the
per-cpu reservation was worthless on 2M pages (reservation is 128k...)
it was only to keep accounting as strict as it should be because the
other code there really went all way down to csize = page_size in case
of failure and tried again. But then it didn't send IPI to other cpus
to release those. So basically you should also remove that "retry"
event which looks pretty worthless with other cpu queues not drained
before retrying and hugepages bypassing the cache entirely. Assume the
cache is an error of 128k*nr_cpus.

> >  	put_cpu_var(memcg_stock);
> >  	return ret;
> > @@ -1401,13 +1406,13 @@ static int __cpuinit memcg_stock_cpu_cal
> >   * oom-killer can be invoked.
> >   */
> >  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > -			gfp_t gfp_mask, struct mem_cgroup **memcg,
> > -			bool oom, struct page *page)
> > +				   gfp_t gfp_mask, struct mem_cgroup **memcg,
> > +				   bool oom, struct page *page, int page_size)
> >  {
> >  	struct mem_cgroup *mem, *mem_over_limit;
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct res_counter *fail_res;
> > -	int csize = CHARGE_SIZE;
> > +	int csize = max(page_size, (int) CHARGE_SIZE);
> >  
> we need max() ?

Not sure I understand the question. max(2M, 128k) looks ok.

> I think should skip this.
> And skip this.

as per above, ok.

> Ah..Hmm...this will be much complicated after Nishimura's "task move" method
> is merged. But ok, for this patch itself.

So I hope is that my patch goes in first so I don't have to make the
much more complicated fix ahahaa ;). Just kidding... Well it's up to
you how you want to handle this.

> Thank you! Seems simpler than expected!

You're welcome. Thanks for the review. So how we want to go from here,
you will incorporate those changes yourself so I only have to maintain
the huge_memory.c part that depends on the above? The above is
transparent hugepage agnostic. For the time being I guess I am forced
to also keep it in my patchset otherwise kernel would fail if somebody
uses mem cgroup, but the ideal is to keep this patch in sync and I
drop it as soon as it goes in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-18  1:33   ` KAMEZAWA Hiroyuki
@ 2009-12-18 16:04     ` Andrea Arcangeli
  2009-12-18 23:06       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-18 16:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Fri, Dec 18, 2009 at 10:33:12AM +0900, KAMEZAWA Hiroyuki wrote:
> Then, maybe we (I?) should cut this part (and some from 27/28) out and
> merge into memcg. It will be helpful to all your work.

You can't merge this part, huge_memory.c is not there yet. But you
should merge 27/28 instead, that one is self contained.

> But I don't like a situation which memcg's charge are filled with _locked_ memory.

There's no locked memory here. It's all swappable.

> (Especially, bad-configured softlimit+hugepage will adds much regression.)
> New counter as "usage of huge page" will be required for memcg, at least.

no, hugepages are fully transparent and userland can't possibly know
if it's running on hugepages or regular pages. The only difference is
in userland going faster, everything else is identical so there's no
need of any other memcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18  5:12       ` Ingo Molnar
  2009-12-18  6:18         ` KOSAKI Motohiro
@ 2009-12-18 18:28         ` Christoph Lameter
  2009-12-18 18:41           ` Dave Hansen
  1 sibling, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2009-12-18 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton,
	Stephen C. Tweedie

On Fri, 18 Dec 2009, Ingo Molnar wrote:

> Note that it became more relevant in the past few years due to the arrival of
> low-latency, lots-of-iops and cheap SSDs. Even on a low end server you can buy
> a good 160 GB SSD for emergency swap with fantastic latency and for a lot less
> money than 160 GB of real RAM. (which RAM wont even fit physically on typical
> mainboards, is much more expensive and uses up more power and is less
> servicable)

Swap occurs in page size chunks. SSDs may help but its still a desaster
area. You can only realistically use swap in a batch environment. It kills
desktop performance etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18 14:05       ` [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
@ 2009-12-18 18:33         ` Christoph Lameter
  2009-12-19 15:09           ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2009-12-18 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, 18 Dec 2009, Andrea Arcangeli wrote:

> On Thu, Dec 17, 2009 at 02:09:47PM -0600, Christoph Lameter wrote:
> > Can we do this step by step? This splitting thing and its
> > associated overhead causes me concerns.
>
> The split_huge_page* functionality whole point is exactly to do things
> step by step. Removing it would mean doing it all at once.

The split huge page thing involved introducing new refcounting and locking
features into the VM. Not a first step thing. And certainly difficult to
verify if it is correct.

> This is like the big kernel lock when SMP initially was
> introduced. Surely kernel would have been a little faster if the big
> kernel lock was never introduced but over time the split_huge_page can
> be removed just like the big kernel lock has been removed. Then the
> PG_compound_lock can go away too.

That is a pretty strange comparison. Split huge page is like introducing
the split pte lock after removing the bkl. You first want to solve the
simpler issues (anon huge) and then see if there is a way to avoid
introducing new locking methods.

> scalable. In the future mmu notifier users that calls gup will stop
> using FOLL_GET and in turn they will stop calling put_page, so
> eliminating any need to take the PG_compound_lock in all KVM fast paths.

Maybe do that first then and never introduce the lock in the first place?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18 18:28         ` Christoph Lameter
@ 2009-12-18 18:41           ` Dave Hansen
  2009-12-18 19:17             ` Mike Travis
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2009-12-18 18:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Rik van Riel, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Chris Wright, Andrew Morton, Stephen C. Tweedie

On Fri, 2009-12-18 at 12:28 -0600, Christoph Lameter wrote:
> On Fri, 18 Dec 2009, Ingo Molnar wrote:
> > Note that it became more relevant in the past few years due to the arrival of
> > low-latency, lots-of-iops and cheap SSDs. Even on a low end server you can buy
> > a good 160 GB SSD for emergency swap with fantastic latency and for a lot less
> > money than 160 GB of real RAM. (which RAM wont even fit physically on typical
> > mainboards, is much more expensive and uses up more power and is less
> > servicable)
> 
> Swap occurs in page size chunks. SSDs may help but its still a desaster
> area. You can only realistically use swap in a batch environment. It kills
> desktop performance etc etc.

True...  Let's say it takes you down to 20% of native performance.
There are plenty of cases where people are selling Xen or KVM slices
where 20% of native performance is more than *fine*.  It may also let
you have VMs that are 3x more dense than they would be able to be
otherwise.  Yes, it kills performance, but performance isn't everything.

For many people price/performance is much more important, and swapping
really helps the price side of that equation.

We *do* need to work on making swap more useful in a wide range of
workloads, especially since SSDs have changed some of our assumptions
about swap.  I just got a laptop SSD this week, and tuned swappiness so
that I'd get some more swap activity.  Things really bogged down, so I
*know* there's work to do there.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2009-12-17 19:54 ` [PATCH 00 of 28] Transparent Hugepage support #2 Christoph Lameter
@ 2009-12-18 18:47 ` Dave Hansen
  2009-12-19 15:20   ` Andrea Arcangeli
  29 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2009-12-18 18:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, 2009-12-17 at 19:00 +0000, Andrea Arcangeli wrote:
> This is an update of my status on the transparent hugepage patchset. Quite
> some changes happened in the last two weeks as I handled all feedback
> provided so far (notably from Avi, Andi, Nick and others), and continuted on
> the original todo list.

For what it's worth, I went trying to do some of this a few months ago
to see how feasible it was.  I ended up doing a bunch of the same stuff
like having the preallocated pte_page() hanging off the mm.  I think I
tied directly into the pte_offset_*() functions instead of introducing
new ones, but the concept was the same: as much as possible *don't*
teach the VM about huge pages, split them.

I ended up getting hung up on some of the PMD locking, and I think using
the PMD bit like that is a fine solution.  The way these are split up
also looks good to me.

Except for some of the stuff in put_compound_page(), these look pretty
sane to me in general.  I'll go through them in more detail after the
holidays.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 10 of 28] add pmd mangling functions to x86
  2009-12-17 19:00 ` [PATCH 10 of 28] add pmd mangling functions to x86 Andrea Arcangeli
@ 2009-12-18 18:56   ` Mel Gorman
  2009-12-19 15:27     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Mel Gorman @ 2009-12-18 18:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

(As a side-note, I am going off-line until after the new years fairly soon.
I'm not doing a proper review at the moment, just taking a first read to
see what's here. Sorry I didn't get a chance to read V1)

On Thu, Dec 17, 2009 at 07:00:13PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add needed pmd mangling functions with simmetry with their pte counterparts.

Silly question, this assumes the bits used in the PTE are not being used in
the PMD for something else, right? Is that guaranteed to be safe? According
to the AMD manual, it's fine but is it typically true on other architectures?

> pmdp_freeze_flush is the only exception only present on the pmd side and it's
> needed to serialize the VM against split_huge_page, it simply atomically clears
> the present bit in the same way pmdp_clear_flush_young atomically clears the
> accessed bit (and both need to flush the tlb to make it effective, which is
> mandatory to happen synchronously for pmdp_freeze_flush).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

One minorish nit below.

> ---
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -95,11 +95,21 @@ static inline int pte_young(pte_t pte)
>  	return pte_flags(pte) & _PAGE_ACCESSED;
>  }
>  
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_ACCESSED;
> +}
> +
>  static inline int pte_write(pte_t pte)
>  {
>  	return pte_flags(pte) & _PAGE_RW;
>  }
>  
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_RW;
> +}
> +
>  static inline int pte_file(pte_t pte)
>  {
>  	return pte_flags(pte) & _PAGE_FILE;
> @@ -150,6 +160,13 @@ static inline pte_t pte_set_flags(pte_t 
>  	return native_make_pte(v | set);
>  }
>  
> +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v | set);
> +}
> +
>  static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  {
>  	pteval_t v = native_pte_val(pte);
> @@ -157,6 +174,13 @@ static inline pte_t pte_clear_flags(pte_
>  	return native_make_pte(v & ~clear);
>  }
>  
> +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v & ~clear);
> +}
> +
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_DIRTY);
> @@ -167,11 +191,21 @@ static inline pte_t pte_mkold(pte_t pte)
>  	return pte_clear_flags(pte, _PAGE_ACCESSED);
>  }
>  
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> +}
> +
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_RW);
> +}
> +
>  static inline pte_t pte_mkexec(pte_t pte)
>  {
>  	return pte_clear_flags(pte, _PAGE_NX);
> @@ -182,16 +216,36 @@ static inline pte_t pte_mkdirty(pte_t pt
>  	return pte_set_flags(pte, _PAGE_DIRTY);
>  }
>  
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_DIRTY);
> +}
> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_PSE);
> +}
> +
>  static inline pte_t pte_mkyoung(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_ACCESSED);
>  }
>  
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_ACCESSED);
> +}
> +
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_RW);
> +}
> +
>  static inline pte_t pte_mkhuge(pte_t pte)
>  {
>  	return pte_set_flags(pte, _PAGE_PSE);
> @@ -320,6 +374,11 @@ static inline int pte_same(pte_t a, pte_
>  	return a.pte == b.pte;
>  }
>  
> +static inline int pmd_same(pmd_t a, pmd_t b)
> +{
> +	return a.pmd == b.pmd;
> +}
> +
>  static inline int pte_present(pte_t a)
>  {
>  	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
> @@ -351,7 +410,7 @@ static inline unsigned long pmd_page_vad
>   * Currently stuck as a macro due to indirect forward reference to
>   * linux/mmzone.h's __section_mem_map_addr() definition:
>   */
> -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
>  

Why is the masking with PTE_PFN_MASK now necessary?

>  /*
>   * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
> @@ -372,6 +431,7 @@ static inline unsigned long pmd_index(un
>   * to linux/mm.h:page_to_nid())
>   */
>  #define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
> +#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
>  
>  /*
>   * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE]
> @@ -568,14 +628,21 @@ struct vm_area_struct;
>  extern int ptep_set_access_flags(struct vm_area_struct *vma,
>  				 unsigned long address, pte_t *ptep,
>  				 pte_t entry, int dirty);
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
>  
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				     unsigned long addr, pte_t *ptep);
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long addr, pmd_t *pmdp);
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>  extern int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pte_t *ptep);
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
>  
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
> @@ -586,6 +653,14 @@ static inline pte_t ptep_get_and_clear(s
>  	return pte;
>  }
>  
> +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
> +				       pmd_t *pmdp)
> +{
> +	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
> +	pmd_update(mm, addr, pmdp);
> +	return pmd;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>  static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  					    unsigned long addr, pte_t *ptep,
> @@ -612,6 +687,16 @@ static inline void ptep_set_wrprotect(st
>  	pte_update(mm, addr, ptep);
>  }
>  
> +static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> +				      unsigned long addr, pmd_t *pmdp)
> +{
> +	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
> +	pmd_update(mm, addr, pmd);
> +}
> +
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long addr, pmd_t *pmdp);
> +
>  /*
>   * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
>   *
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -71,6 +71,18 @@ static inline pte_t native_ptep_get_and_
>  	return ret;
>  #endif
>  }
> +static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
> +{
> +#ifdef CONFIG_SMP
> +	return native_make_pmd(xchg(&xp->pmd, 0));
> +#else
> +	/* native_local_pmdp_get_and_clear,
> +	   but duplicated because of cyclic dependency */
> +	pmd_t ret = *xp;
> +	native_pmd_clear(NULL, 0, xp);
> +	return ret;
> +#endif
> +}
>  
>  static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -288,6 +288,23 @@ int ptep_set_access_flags(struct vm_area
>  	return changed;
>  }
>  
> +int pmdp_set_access_flags(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp,
> +			  pmd_t entry, int dirty)
> +{
> +	int changed = !pmd_same(*pmdp, entry);
> +
> +	VM_BUG_ON(address & ~HPAGE_MASK);
> +

On the use of HPAGE_MASK, did you intend to use the PMD mask? Granted,
it works out as being the same thing in this context but if there is
ever support for 1GB pages at the next page table level, it could get
confusing.

> +	if (changed && dirty) {
> +		*pmdp = entry;
> +		pmd_update_defer(vma->vm_mm, address, pmdp);
> +		flush_tlb_range(vma, address, address + HPAGE_SIZE);
> +	}
> +
> +	return changed;
> +}
> +
>  int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  			      unsigned long addr, pte_t *ptep)
>  {
> @@ -303,6 +320,21 @@ int ptep_test_and_clear_young(struct vm_
>  	return ret;
>  }
>  
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long addr, pmd_t *pmdp)
> +{
> +	int ret = 0;
> +
> +	if (pmd_young(*pmdp))
> +		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
> +					 (unsigned long *) &pmdp->pmd);
> +
> +	if (ret)
> +		pmd_update(vma->vm_mm, addr, pmdp);
> +
> +	return ret;
> +}
> +
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> @@ -315,6 +347,34 @@ int ptep_clear_flush_young(struct vm_are
>  	return young;
>  }
>  
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmdp)
> +{
> +	int young;
> +
> +	VM_BUG_ON(address & ~HPAGE_MASK);
> +
> +	young = pmdp_test_and_clear_young(vma, address, pmdp);
> +	if (young)
> +		flush_tlb_range(vma, address, address + HPAGE_SIZE);
> +
> +	return young;
> +}
> +
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	int set;
> +	VM_BUG_ON(address & ~HPAGE_MASK);
> +	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
> +				(unsigned long *)&pmdp->pmd);
> +	if (set) {
> +		pmd_update(vma->vm_mm, address, pmdp);
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_SIZE);
> +	}
> +}
> +
>  /**
>   * reserve_top_address - reserves a hole in the top of kernel address space
>   * @reserve - size of hole to reserve
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 13 of 28] bail out gup_fast on freezed pmd
  2009-12-17 19:00 ` [PATCH 13 of 28] bail out gup_fast on freezed pmd Andrea Arcangeli
@ 2009-12-18 18:59   ` Mel Gorman
  2009-12-19 15:48     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Mel Gorman @ 2009-12-18 18:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Dec 17, 2009 at 07:00:16PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Force gup_fast to take the slow path and block if the pmd is freezed, not only
> if it's none.
> 

What does the slow path do when the same PMD is encountered? Assume it's
clear later but the set at the moment kinda requires you to understand
the entire series all at once.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -156,7 +156,7 @@ static int gup_pmd_range(pud_t pud, unsi
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
>  			return 0;
>  		if (unlikely(pmd_large(pmd))) {
>  			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 14 of 28] pte alloc trans splitting
  2009-12-17 19:00 ` [PATCH 14 of 28] pte alloc trans splitting Andrea Arcangeli
@ 2009-12-18 19:03   ` Mel Gorman
  2009-12-19 15:59     ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Mel Gorman @ 2009-12-18 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Dec 17, 2009 at 07:00:17PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> pte alloc routines must wait for split_huge_page if the pmd is not
> present and not null (i.e. pmd_trans_splitting).

More stupid questions. When a large page is about to be split, you clear the
present bit to cause faults and hold those accesses until the split completes?
Again, no doubt this is obvious later but a description in the leader of
the basic approach to splitting huge pages wouldn't kill.

> The additional
> branches are optimized away at compile time by pmd_trans_splitting if
> the config option is off. However we must pass the vma down in order
> to know the anon_vma lock to wait for.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -948,7 +948,8 @@ static inline int __pmd_alloc(struct mm_
>  int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
>  #endif
>  
> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
> +int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> +		pmd_t *pmd, unsigned long address);
>  int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
>  
>  /*
> @@ -1017,12 +1018,14 @@ static inline void pgtable_page_dtor(str
>  	pte_unmap(pte);					\
>  } while (0)
>  
> -#define pte_alloc_map(mm, pmd, address)			\
> -	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
> -		NULL: pte_offset_map(pmd, address))
> +#define pte_alloc_map(mm, vma, pmd, address)				\
> +	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
> +							pmd, address))?	\
> +	 NULL: pte_offset_map(pmd, address))
>  
>  #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
> -	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
> +	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
> +							pmd, address))?	\
>  		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
>  
>  #define pte_alloc_kernel(pmd, address)			\
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -324,9 +324,11 @@ void free_pgtables(struct mmu_gather *tl
>  	}
>  }
>  
> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
> +int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> +		pmd_t *pmd, unsigned long address)
>  {
>  	pgtable_t new = pte_alloc_one(mm, address);
> +	int wait_split_huge_page;
>  	if (!new)
>  		return -ENOMEM;
>  
> @@ -346,14 +348,18 @@ int __pte_alloc(struct mm_struct *mm, pm
>  	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>  
>  	spin_lock(&mm->page_table_lock);
> -	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
> +	wait_split_huge_page = 0;
> +	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>  		mm->nr_ptes++;
>  		pmd_populate(mm, pmd, new);
>  		new = NULL;
> -	}
> +	} else if (unlikely(pmd_trans_splitting(*pmd)))
> +		wait_split_huge_page = 1;
>  	spin_unlock(&mm->page_table_lock);
>  	if (new)
>  		pte_free(mm, new);
> +	if (wait_split_huge_page)
> +		wait_split_huge_page(vma->anon_vma, pmd);
>  	return 0;
>  }
>  
> @@ -366,10 +372,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
>  	smp_wmb(); /* See comment in __pte_alloc */
>  
>  	spin_lock(&init_mm.page_table_lock);
> -	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
> +	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>  		pmd_populate_kernel(&init_mm, pmd, new);
>  		new = NULL;
> -	}
> +	} else
> +		VM_BUG_ON(pmd_trans_splitting(*pmd));
>  	spin_unlock(&init_mm.page_table_lock);
>  	if (new)
>  		pte_free_kernel(&init_mm, new);
> @@ -3020,7 +3027,7 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> -	pte = pte_alloc_map(mm, pmd, address);
> +	pte = pte_alloc_map(mm, vma, pmd, address);
>  	if (!pte)
>  		return VM_FAULT_OOM;
>  
> diff --git a/mm/mremap.c b/mm/mremap.c
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru
>  	return pmd;
>  }
>  
> -static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
> +static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> +			    unsigned long addr)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
>  	if (!pmd)
>  		return NULL;
>  
> -	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
> +	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
>  		return NULL;
>  
>  	return pmd;
> @@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm
>  		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
>  		if (!old_pmd)
>  			continue;
> -		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
> +		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
>  		next = (new_addr + PMD_SIZE) & PMD_MASK;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 22 of 28] clear_huge_page fix
  2009-12-17 19:00 ` [PATCH 22 of 28] clear_huge_page fix Andrea Arcangeli
@ 2009-12-18 19:16   ` Mel Gorman
  0 siblings, 0 replies; 89+ messages in thread
From: Mel Gorman @ 2009-12-18 19:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Dec 17, 2009 at 07:00:25PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> sz is in bytes, MAX_ORDER_NR_PAGES is in pages.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

While accurate, it doesn't seem to have anything to do with the set.
Should be sent up separetly.

> ---
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -402,7 +402,7 @@ static void clear_huge_page(struct page 
>  {
>  	int i;
>  
> -	if (unlikely(sz > MAX_ORDER_NR_PAGES)) {
> +	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
>  		clear_gigantic_page(page, addr, sz);
>  		return;
>  	}
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18 18:41           ` Dave Hansen
@ 2009-12-18 19:17             ` Mike Travis
  2009-12-18 19:28               ` Swap on flash SSDs Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Travis @ 2009-12-18 19:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Ingo Molnar, Rik van Riel, Andrea Arcangeli,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton, Stephen C. Tweedie



Dave Hansen wrote:
> On Fri, 2009-12-18 at 12:28 -0600, Christoph Lameter wrote:
>> On Fri, 18 Dec 2009, Ingo Molnar wrote:
>>> Note that it became more relevant in the past few years due to the arrival of
>>> low-latency, lots-of-iops and cheap SSDs. Even on a low end server you can buy
>>> a good 160 GB SSD for emergency swap with fantastic latency and for a lot less
>>> money than 160 GB of real RAM. (which RAM wont even fit physically on typical
>>> mainboards, is much more expensive and uses up more power and is less
>>> servicable)
>> Swap occurs in page size chunks. SSDs may help but its still a desaster
>> area. You can only realistically use swap in a batch environment. It kills
>> desktop performance etc etc.
> 
> True...  Let's say it takes you down to 20% of native performance.
> There are plenty of cases where people are selling Xen or KVM slices
> where 20% of native performance is more than *fine*.  It may also let
> you have VMs that are 3x more dense than they would be able to be
> otherwise.  Yes, it kills performance, but performance isn't everything.
> 
> For many people price/performance is much more important, and swapping
> really helps the price side of that equation.
> 
> We *do* need to work on making swap more useful in a wide range of
> workloads, especially since SSDs have changed some of our assumptions
> about swap.  I just got a laptop SSD this week, and tuned swappiness so
> that I'd get some more swap activity.  Things really bogged down, so I
> *know* there's work to do there.
> 
> -- Dave

Interesting discussion about SSD's.  I was under the impression that with
the finite number of write cycles to an SSD, that unnecessary writes were
to be avoided?

	Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Swap on flash SSDs
  2009-12-18 19:17             ` Mike Travis
@ 2009-12-18 19:28               ` Dave Hansen
  2009-12-18 19:38                 ` Andi Kleen
  2009-12-18 19:39                 ` Ingo Molnar
  0 siblings, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2009-12-18 19:28 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Ingo Molnar, Rik van Riel, Andrea Arcangeli,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton, Stephen C. Tweedie

On Fri, 2009-12-18 at 11:17 -0800, Mike Travis wrote:
> Interesting discussion about SSD's.  I was under the impression that with
> the finite number of write cycles to an SSD, that unnecessary writes were
> to be avoided?

I'm no expert, but my impression was that this was a problem with other
devices and with "bare" flash, and mostly when writing to the same place
over and over.

Modern, well-made flash SSDs and other flash devices have wear-leveling
built in so that they wear all of the flash cells evenly.  There's still
a discrete number of writes that they can handle over their life, but it
should be high enough that you don't notice.

http://en.wikipedia.org/wiki/Solid-state_drive

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Swap on flash SSDs
  2009-12-18 19:28               ` Swap on flash SSDs Dave Hansen
@ 2009-12-18 19:38                 ` Andi Kleen
  2009-12-18 19:39                 ` Ingo Molnar
  1 sibling, 0 replies; 89+ messages in thread
From: Andi Kleen @ 2009-12-18 19:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Travis, Christoph Lameter, Ingo Molnar, Rik van Riel,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Benjamin Herrenschmidt, KAMEZAWA Hiroyuki,
	Chris Wright, Andrew Morton, Stephen C. Tweedie

> Modern, well-made flash SSDs and other flash devices have wear-leveling
> built in so that they wear all of the flash cells evenly.  There's still
> a discrete number of writes that they can handle over their life, but it
> should be high enough that you don't notice.

The keyword is "well-made"

It depends on how much you pay for it. Don't expect that from
super cheap USB sticks. But I believe it to be true for higher
end flash, with a continuum there 
(server > highend consumer > lowend consumer >> cheap junk)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Swap on flash SSDs
  2009-12-18 19:28               ` Swap on flash SSDs Dave Hansen
  2009-12-18 19:38                 ` Andi Kleen
@ 2009-12-18 19:39                 ` Ingo Molnar
  2009-12-18 20:13                   ` Linus Torvalds
  2009-12-19 18:38                   ` Jörn Engel
  1 sibling, 2 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-12-18 19:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Travis, Christoph Lameter, Rik van Riel, Andrea Arcangeli,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, KAMEZAWA Hiroyuki, Chris Wright,
	Andrew Morton, Stephen C. Tweedie, Linus Torvalds


* Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Fri, 2009-12-18 at 11:17 -0800, Mike Travis wrote:
>
> > Interesting discussion about SSD's.  I was under the impression that with 
> > the finite number of write cycles to an SSD, that unnecessary writes were 
> > to be avoided?
> 
> I'm no expert, but my impression was that this was a problem with other 
> devices and with "bare" flash, and mostly when writing to the same place 
> over and over.
> 
> Modern, well-made flash SSDs and other flash devices have wear-leveling 
> built in so that they wear all of the flash cells evenly.  There's still a 
> discrete number of writes that they can handle over their life, but it 
> should be high enough that you don't notice.
> 
> http://en.wikipedia.org/wiki/Solid-state_drive

A quality SDD is supposed to wear off in continuous non-stop write traffic 
after its Mean Time Between Failures. (Obviously it will take a few years for 
drives to gather that kind of true physical track record - right now what we 
have is the claims of manufacturers and 1-2 years of a track record.)

And even when a cell does go bad and all the spares are gone, the failure mode 
is not catastrophic like with a hard disk, but that particular cell goes 
read-only and you can still recover the info and use the remaining cells.

Sidenote: i think we should make the Linux swap code resilient against write 
IO errors of that fashion and reallocate the swap entry to a free slot. Right 
now in mm/page_io.c's end_swap_bio_write() we do this:

                /*
                 * We failed to write the page out to swap-space.
                 * Re-dirty the page in order to avoid it being reclaimed.
                 * Also print a dire warning that things will go BAD (tm)
                 * very quickly.
                 *
                 * Also clear PG_reclaim to avoid rotate_reclaimable_page()
                 */
                set_page_dirty(page);
                printk(KERN_ALERT "Write-error on swap-device (%u:%u:%Lu)\n",
                                imajor(bio->bi_bdev->bd_inode),
                                iminor(bio->bi_bdev->bd_inode),
                                (unsigned long long)bio->bi_sector);
                ClearPageReclaim(page);

We could be more intelligent than printing a scary error: we could clear that 
page from the swap map [permanently] and retry. It will still have a long-term 
failure mode when all swap pages are depleted - but that's still quite a slow 
failure mode and it is actionable via servicing.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-17 19:00 ` [PATCH 25 of 28] transparent hugepage core Andrea Arcangeli
@ 2009-12-18 20:03   ` Mel Gorman
  2009-12-19 16:41     ` Andrea Arcangeli
  2010-01-04  6:16   ` Daisuke Nishimura
  1 sibling, 1 reply; 89+ messages in thread
From: Mel Gorman @ 2009-12-18 20:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
>    null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,110 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_anonymous_page(struct mm_struct *mm,
> +				  struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmd,
> +				  unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmd,
> +			   pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);

The naming of "huge" might bite in the ass later if/when transparent
support is applied to multiple page sizes. Granted, it's not happening
any time soon.

> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
> +			      struct vm_area_struct *vma,
> +			      pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,

Defrag is misleading. Glancing through the rest of the patch, "try harder"
would be a more appropriate term because it uses __GFP_REPEAT.

> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &				\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)			       \
> +	(transparent_hugepage_flags &				       \
> +	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) ||		       \
> +	 (transparent_hugepage_flags &				       \
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&	       \
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
> +				 pmd_t *pmd);
> +extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
> +extern int split_huge_page(struct page *page);
> +#define split_huge_page_mm(__mm, __addr, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_mm(__mm, __addr, __pmd);	\
> +	}  while (0)
> +#define split_huge_page_vma(__vma, __pmd)				\
> +	do {								\
> +		if (unlikely(pmd_trans_huge(*(__pmd))))			\
> +			__split_huge_page_vma(__vma, __pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		smp_mb();						\
> +		spin_unlock_wait(&(__anon_vma)->lock);			\
> +		smp_mb();						\
> +		VM_BUG_ON(pmd_trans_splitting(*(__pmd)) ||		\
> +			  pmd_trans_huge(*(__pmd)));			\
> +	} while (0)
> +#define HPAGE_ORDER (HPAGE_SHIFT-PAGE_SHIFT)
> +#define HPAGE_NR (1<<HPAGE_ORDER)
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_mm(__mm, __addr, __pmd)	\
> +	do { }  while (0)
> +#define split_huge_page_vma(__vma, __pmd)	\
> +	do { }  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -234,6 +234,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c

Similar on naming. Later someone will get congused as to why there is
hugetlbfs and huge_memory.

> @@ -0,0 +1,792 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG)|(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init ksm_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(ksm_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	if (!str)
> +		return 0;
> +	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
> +	return 1;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_anonymous_page(struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
> +				    unsigned long address, pmd_t *pmd,
> +				    struct page *page,

Maybe this should be do_pmd_anonymous page and match what do_anonymous_page
does as much as possible. This might offset any future problems related to
transparently handling pages at other page table levels.

> +				    unsigned long haddr)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, address);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_NR);
> +

Ideally insead of defining things like HPAGE_NR, the existing functions for
multiple huge page sizes would be extended to return the "huge page size
corresponding to a PMD".

> +	__SetPageUptodate(page);
> +	smp_wmb();
> +

Need to explain why smp_wmb() is needed there. It doesn't look like
you're protecting the bit set itself. More likely you are making sure
the writes in clear_huge_page() have finished but that's a guess.
Comment.

> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		put_page(page);
> +		pte_free(mm, pgtable);

Racing fault already filled in the PTE? If so, comment please. Again,
matching how do_anonymous_page() does a similar job would help
comprehension.

> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +	
> +	return ret;
> +}
> +
> +int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmd,
> +			   unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> +				   (transparent_hugepage_defrag(vma) ?

GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable
is set in /proc/sys/vm. This should be GFP_HIGHUSER only.

> +				    __GFP_REPEAT : 0)|__GFP_NOWARN,
> +				   HPAGE_ORDER);
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_anonymous_page(mm, vma,
> +						address, pmd,
> +						page, haddr);
> +	}
> +out:
> +	pte = pte_alloc_map(mm, vma, pmd, address);
> +	if (!pte)
> +		return VM_FAULT_OOM;
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd)))
> +		goto out_unlock;
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_pgtable(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, anon_rss, HPAGE_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL; /* debug */
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +		    unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0, i;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +	struct page **pages;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_pgtable(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	new_page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> +			       (transparent_hugepage_defrag(vma) ?
> +				__GFP_REPEAT : 0)|__GFP_NOWARN,
> +			       HPAGE_ORDER);
> +	if (transparent_hugepage_debug_cow() && new_page) {
> +		put_page(new_page);
> +		new_page = NULL;
> +	}
> +	if (unlikely(!new_page)) {

This entire block needs be in a demote_pmd_page() or something similar.
It's on the hefty side for being in the main function. That said, I
didn't spot anything wrong in there either.

> +		pgtable_t pgtable;
> +		pmd_t _pmd;
> +
> +		pages = kzalloc(sizeof(struct page *) * HPAGE_NR,
> +				GFP_KERNEL);
> +		if (unlikely(!pages)) {
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +		
> +		for (i = 0; i < HPAGE_NR; i++) {
> +			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +						  vma, address);
> +			if (unlikely(!pages[i])) {
> +				while (--i >= 0)
> +					put_page(pages[i]);
> +				kfree(pages);
> +				ret |= VM_FAULT_OOM;
> +				goto out;
> +			}
> +		}
> +
> +		spin_lock(&mm->page_table_lock);
> +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +			goto out_free_pages;
> +		else
> +			get_page(page);
> +		spin_unlock(&mm->page_table_lock);
> +
> +		might_sleep();

Is this check really necessary? We could already go alseep easier when
allocating pages.

> +		for (i = 0; i < HPAGE_NR; i++) {
> +			copy_user_highpage(pages[i], page + i,

More nasty naming there. Needs to be cleared that pages is your demoted
base pages and page is the existing compound page.

> +					   haddr + PAGE_SHIFT*i, vma);
> +			__SetPageUptodate(pages[i]);
> +			cond_resched();
> +		}
> +
> +		spin_lock(&mm->page_table_lock);
> +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +			goto out_free_pages;
> +		else
> +			put_page(page);
> +
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		/* leave pmd empty until pte is filled */
> +
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0; i < HPAGE_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			entry = mk_pte(pages[i], vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			page_add_new_anon_rmap(pages[i], vma, haddr);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			VM_BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +		kfree(pages);
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pgtable);
> +		spin_unlock(&mm->page_table_lock);
> +
> +		ret |= VM_FAULT_WRITE;
> +		page_remove_rmap(page);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
> +	__SetPageUptodate(new_page);
> +
> +	smp_wmb();
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	for (i = 0; i < HPAGE_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out_unlock;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_pgtable(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;

More HPAGE vs PMD here.

> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_pmd_trans_huge(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		       pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pfn_to_page(pmd_pfn(*pmd));
> +			VM_BUG_ON(!PageCompound(page));
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			add_mm_counter(tlb->mm, anon_rss, -HPAGE_NR);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd) && pmd_pgtable(*pmd) == page) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, pmd_huge must remain on at all times.
> +		 */

Why, to avoid a double fault? Or to avoid a case where the huge page is
being split, another fault occurs and zero-filled pages get faulted in?

I'm afraid I ran out of time at this point. It'll be after the holidays
before I get time for a proper go at it. Sorry.

> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +		page_tail->index = ++head_index;
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		if (page_evictable(page_tail, NULL))
> +			lru_cache_add_lru(page_tail, LRU_ACTIVE_ANON);
> +		else
> +			add_page_to_unevictable_list(page_tail);
> +		put_page(page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct vm_area_struct *vma;
> +
> +	BUG_ON(!PageHead(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */
> +void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
> +{
> +	struct page *page;
> +	struct anon_vma *anon_vma;
> +	struct mm_struct *mm;
> +
> +	BUG_ON(vma->vm_flags & VM_HUGETLB);
> +
> +	mm = vma->vm_mm;
> +	BUG_ON(down_write_trylock(&mm->mmap_sem));
> +
> +	anon_vma = vma->anon_vma;
> +
> +	spin_lock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_splitting(*pmd));
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		spin_unlock(&anon_vma->lock);
> +		return;
> +	}
> +	page = pmd_pgtable(*pmd);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	__split_huge_page(page, anon_vma);
> +
> +	spin_unlock(&anon_vma->lock);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> +
> +/* must run with mmap_sem to prevent vma to go away */
> +void __split_huge_page_mm(struct mm_struct *mm,
> +			  unsigned long address,
> +			  pmd_t *pmd)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = find_vma(mm, address);
> +	BUG_ON(vma->vm_start > address);
> +	BUG_ON(vma->vm_mm != mm);
> +
> +	__split_huge_page_vma(vma, pmd);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> + 	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -647,9 +647,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -722,6 +722,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -918,6 +928,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_SIZE)
> +				split_huge_page_vma(vma, pmd);
> +			else if (zap_pmd_trans_huge(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1185,11 +1204,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -1298,6 +1333,7 @@ int __get_user_pages(struct task_struct 
>  			pmd = pmd_offset(pud, pg);
>  			if (pmd_none(*pmd))
>  				return i ? : -EFAULT;
> +			VM_BUG_ON(pmd_trans_huge(*pmd));
>  			pte = pte_offset_map(pmd, pg);
>  			if (pte_none(*pte)) {
>  				pte_unmap(pte);
> @@ -2949,9 +2985,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3027,6 +3063,22 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_anonymous_page(mm, vma, address,
> +						      pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();
> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_wp_page(mm, vma, address,
> +						       pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
>  	pte = pte_alloc_map(mm, vma, pmd, address);
>  	if (!pte)
>  		return VM_FAULT_OOM;
> @@ -3167,6 +3219,7 @@ static int follow_pte(struct mm_struct *
>  		goto out;
>  
>  	pmd = pmd_offset(pud, address);
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>  		goto out;
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -56,6 +56,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -229,7 +230,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -343,35 +344,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -380,9 +363,43 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageCompound(page))) {
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +		VM_BUG_ON(1);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Swap on flash SSDs
  2009-12-18 19:39                 ` Ingo Molnar
@ 2009-12-18 20:13                   ` Linus Torvalds
  2009-12-18 20:31                     ` Ingo Molnar
  2009-12-19 18:38                   ` Jörn Engel
  1 sibling, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2009-12-18 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Mike Travis, Christoph Lameter, Rik van Riel,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Benjamin Herrenschmidt, KAMEZAWA Hiroyuki,
	Chris Wright, Andrew Morton, Stephen C. Tweedie



On Fri, 18 Dec 2009, Ingo Molnar wrote:
> 
> And even when a cell does go bad and all the spares are gone, the failure mode 
> is not catastrophic like with a hard disk, but that particular cell goes 
> read-only and you can still recover the info and use the remaining cells.

Maybe. The real issue is the flash firmware. You want to bet it hasn't 
been tested very well against wear-related failures in real life? 

Once the flash firmware gets confused due to some bug, the end result is 
usually a totally dead device.

So failure modes can easily be pretty damn catastrophic. Not that that is 
in any way specific to flash (the failures I've seen on rotational disks 
have been generally catastrophic too - people who malign flashes for some 
reason don't seem to admit that rotational media tends to have all the 
same problems and then some).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Swap on flash SSDs
  2009-12-18 20:13                   ` Linus Torvalds
@ 2009-12-18 20:31                     ` Ingo Molnar
  0 siblings, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-12-18 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Mike Travis, Christoph Lameter, Rik van Riel,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Benjamin Herrenschmidt, KAMEZAWA Hiroyuki,
	Chris Wright, Andrew Morton, Stephen C. Tweedie


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 18 Dec 2009, Ingo Molnar wrote:
> > 
> > And even when a cell does go bad and all the spares are gone, the failure 
> > mode is not catastrophic like with a hard disk, but that particular cell 
> > goes read-only and you can still recover the info and use the remaining 
> > cells.
> 
> Maybe. The real issue is the flash firmware. You want to bet it hasn't been 
> tested very well against wear-related failures in real life?

I certainly dont want to bet anything on technology that is just a few years 
old :-) I have an SSD, and i keep backups.

( Okay, i have to admit that i have a weakness for certain types of unproven 
  technology, such as toy kernels that are just a hobby ;-)

> Once the flash firmware gets confused due to some bug, the end result is 
> usually a totally dead device.
> 
> So failure modes can easily be pretty damn catastrophic. Not that that is in 
> any way specific to flash (the failures I've seen on rotational disks have 
> been generally catastrophic too - people who malign flashes for some reason 
> don't seem to admit that rotational media tends to have all the same 
> problems and then some).

There's also electronics failure that could occur. Plus physical damage.

But at least data recovery does not need a clean room ;-) [ If the cells are 
still undamaged, if it wasnt a lightning strike, an earthquake or a two year 
old that damaged them then an identical model can be used for recovery. ]

( If the data matters. If it doesnt then nobody will care about anything but
  everyday usability, a lifetime of at least a few months, plus price,
  performance/latency and maybe shock resistence. )

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-18 16:04     ` Andrea Arcangeli
@ 2009-12-18 23:06       ` KAMEZAWA Hiroyuki
  2009-12-20 18:39         ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-18 23:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

Andrea Arcangeli wrote:
> On Fri, Dec 18, 2009 at 10:33:12AM +0900, KAMEZAWA Hiroyuki wrote:
>> Then, maybe we (I?) should cut this part (and some from 27/28) out and
>> merge into memcg. It will be helpful to all your work.
>
> You can't merge this part, huge_memory.c is not there yet. But you
> should merge 27/28 instead, that one is self contained.
>
>> But I don't like a situation which memcg's charge are filled with
>> _locked_ memory.
>
> There's no locked memory here. It's all swappable.
>
Ok, I missed.

My intentsion was adding a patch for adding "pagesize" parameters
to charge/uncharge function may be able to reduce size of changes.

>> (Especially, bad-configured softlimit+hugepage will adds much
>> regression.)
>> New counter as "usage of huge page" will be required for memcg, at
>> least.
>
> no, hugepages are fully transparent and userland can't possibly know
> if it's running on hugepages or regular pages. The only difference is
> in userland going faster, everything else is identical so there's no
> need of any other memcg.
>
I read your patch again.
Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18 18:33         ` Christoph Lameter
@ 2009-12-19 15:09           ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 15:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rik van Riel, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton

On Fri, Dec 18, 2009 at 12:33:36PM -0600, Christoph Lameter wrote:
> On Fri, 18 Dec 2009, Andrea Arcangeli wrote:
> 
> > On Thu, Dec 17, 2009 at 02:09:47PM -0600, Christoph Lameter wrote:
> > > Can we do this step by step? This splitting thing and its
> > > associated overhead causes me concerns.
> >
> > The split_huge_page* functionality whole point is exactly to do things
> > step by step. Removing it would mean doing it all at once.
> 
> The split huge page thing involved introducing new refcounting and locking
> features into the VM. Not a first step thing. And certainly difficult to
> verify if it is correct.

I can explain how it works no problem. I already did with Marcelo who
also audited my change to put_page.

> > This is like the big kernel lock when SMP initially was
> > introduced. Surely kernel would have been a little faster if the big
> > kernel lock was never introduced but over time the split_huge_page can
> > be removed just like the big kernel lock has been removed. Then the
> > PG_compound_lock can go away too.
> 
> That is a pretty strange comparison. Split huge page is like introducing
> the split pte lock after removing the bkl. You first want to solve the
> simpler issues (anon huge) and then see if there is a way to avoid
> introducing new locking methods.

I can't get your comparison... The reasoning behind my comparison is
very simple: we can't put spinlocks everywhere and pretend the kernel
to become threaded as a whole overnight. But we can put a BKL
(split_huge_page) that takes care of all not-yet-threaded (hugepage
aware) code paths that can't be converted overnight (swap, and all the
rest of mm/*.c) while we start threading file-by-file. First the
scheduler (malloc/free) and then the rest... Removing split_huge_page
if needed is simple.

> > scalable. In the future mmu notifier users that calls gup will stop
> > using FOLL_GET and in turn they will stop calling put_page, so
> > eliminating any need to take the PG_compound_lock in all KVM fast paths.
> 
> Maybe do that first then and never introduce the lock in the first place?

It's not feasible as I documented in previous emails. Removing
FOLL_GET surely would remove the need of all refcounting changes in
put_page for tail pages, and it would remove the need of
PG_compound_lock, but this only works for gup users that are
registered into mmu notifier! O_DIRECT will never be able to use mmu
notifier because it does DMA and we can't interrupt DMA in the middle
from the mmu notifier invalidate handler.

To say it in another way the only way to remove the PG_compound_lock
used by put_page _only_ when called on PageTail pages, is to force
anybody calling gup to be registered into mmu notifier and supporting
interrupting any access to the physical page returned by gup before
returning from mmu_notifier_invalidate*.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 00 of 28] Transparent Hugepage support #2
  2009-12-18 18:47 ` Dave Hansen
@ 2009-12-19 15:20   ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 15:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Fri, Dec 18, 2009 at 10:47:29AM -0800, Dave Hansen wrote:
> For what it's worth, I went trying to do some of this a few months ago
> to see how feasible it was.  I ended up doing a bunch of the same stuff
> like having the preallocated pte_page() hanging off the mm.  I think I
> tied directly into the pte_offset_*() functions instead of introducing
> new ones, but the concept was the same: as much as possible *don't*
> teach the VM about huge pages, split them.

Obviously I agree ;). At the same time I also agree with Christoph
about the long term: in the future we want more and more code paths to
be hugepage aware, and even swap in 2M chunks but I think those things
should happen incrementally over time, just like the kernel didn't
become multithreaded overnight. And if one uses "echo madvise
>enabled" one can already make sure 99% to never run into
split_huge_page (actually 100% sure after swapoff -a), so this greatly
simplified approach already provides 100% of the benefit for example
to KVM hypervisor, where NTP/EPT definitely require hugepages. On host
hugepages are a significant but not so mandatory improvement and in
turn only very few apps get through the pain of using hugetlbfs API or
libhugetlbfs, but NPT/EPT explodes the benefit and makes it a
requirement to use _always_ and to make sure all guest physical pages
are mapped with NPT/EPT pmds.

> I ended up getting hung up on some of the PMD locking, and I think using
> the PMD bit like that is a fine solution.  The way these are split up
> also looks good to me.

Yep, please review if it's ok the page remains mapped in userland
during the split (see __split_huge_page_splitting). In previous
patchset I cleared the present bit in the pmd (which provided the same
information as no pmd could ever be not present and not null
before). But that stopped userland accesses as well during the split,
which Avi said is not required and I agreed.

> Except for some of the stuff in put_compound_page(), these look pretty
> sane to me in general.  I'll go through them in more detail after the
> holidays.

Thanks a lot!!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 10 of 28] add pmd mangling functions to x86
  2009-12-18 18:56   ` Mel Gorman
@ 2009-12-19 15:27     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Fri, Dec 18, 2009 at 06:56:02PM +0000, Mel Gorman wrote:
> (As a side-note, I am going off-line until after the new years fairly soon.
> I'm not doing a proper review at the moment, just taking a first read to
> see what's here. Sorry I didn't get a chance to read V1)

Not reading v1 means less wasted time for you, as I did lot of
polishing as result of the previous reviews, so no problem ;).

> On Thu, Dec 17, 2009 at 07:00:13PM -0000, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Add needed pmd mangling functions with simmetry with their pte counterparts.
> 
> Silly question, this assumes the bits used in the PTE are not being used in
> the PMD for something else, right? Is that guaranteed to be safe? According
> to the AMD manual, it's fine but is it typically true on other architectures?

I welcome people to double check with intel/amd manuals, but it's not
like I added those functions blindly hoping they would work ;),
luckily there's no intel/amd difference here because this stuff even
works on 32bit dinosaurs since PSE was added.

> > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -288,6 +288,23 @@ int ptep_set_access_flags(struct vm_area
> >  	return changed;
> >  }
> >  
> > +int pmdp_set_access_flags(struct vm_area_struct *vma,
> > +			  unsigned long address, pmd_t *pmdp,
> > +			  pmd_t entry, int dirty)
> > +{
> > +	int changed = !pmd_same(*pmdp, entry);
> > +
> > +	VM_BUG_ON(address & ~HPAGE_MASK);
> > +
> 
> On the use of HPAGE_MASK, did you intend to use the PMD mask? Granted,
> it works out as being the same thing in this context but if there is
> ever support for 1GB pages at the next page table level, it could get
> confusing.

That's a very good question but it's not about the above only. I've
always been undecided if to use HPAGE_MASK or the pmd mask. I've no
clue what is better. I think as long as I use HPAGE_MASK all over
huge_memory.c this also should be an HPAGE_MASK. If we were to support
more levels (something not feasible with 1G as the whole point of this
feature is to be transparent and transparently 1G pages will never
come, even 2M is hard) I would expect all those HPAGE_MASK to be
replaced by something else.

If people thinks I should drop HPAGE_MASK as a whole and replace it
with PMD based masks let me know.. doing it only above and leave
HPAGE_MASK elsewhere is no-way IMHO. Personally I find more intuitive
HPAGE_MASK but it clearly matches the pmd mask as it gets mapped by a
pmd entry ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 13 of 28] bail out gup_fast on freezed pmd
  2009-12-18 18:59   ` Mel Gorman
@ 2009-12-19 15:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 15:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Fri, Dec 18, 2009 at 06:59:34PM +0000, Mel Gorman wrote:
> On Thu, Dec 17, 2009 at 07:00:16PM -0000, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Force gup_fast to take the slow path and block if the pmd is freezed, not only
> > if it's none.
> > 
> 
> What does the slow path do when the same PMD is encountered? Assume it's
> clear later but the set at the moment kinda requires you to understand
> the entire series all at once.

The only brainer thing of gup-fast is the fast
path. The moment you return zero you know you're slow and safe and
simple.

This check below is also why pmdp_splitting_flush has to flush the
tlb, to stop this gup-fast code from running while we set the
splitting bit in the pmd.

The slow path simply will call wait_split_huge_page, gup-fast can't
because it has irq disabled and wait_split_huge_page would never
return as the ipi wouldn't run.

I will add a comment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 14 of 28] pte alloc trans splitting
  2009-12-18 19:03   ` Mel Gorman
@ 2009-12-19 15:59     ` Andrea Arcangeli
  2009-12-21 19:57       ` Mel Gorman
  0 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 15:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Fri, Dec 18, 2009 at 07:03:34PM +0000, Mel Gorman wrote:
> On Thu, Dec 17, 2009 at 07:00:17PM -0000, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > pte alloc routines must wait for split_huge_page if the pmd is not
> > present and not null (i.e. pmd_trans_splitting).
> 
> More stupid questions. When a large page is about to be split, you clear the
> present bit to cause faults and hold those accesses until the split completes?

That was previous version. New version doesn't clear the present bit
but sets its own reserved bit in the pmd. All we have to protect is
kernel code, not userland. We have to protect against anything that
will change the mapcount. The mapcount is the key here, as it is only
accounted in the head page and it has to be transferred to all tail
pages during the split. So during the split the mapcount can't
change. But that doesn't mean userland can't keep changing and reading
the page contents while we transfer the mapcount.

> Again, no doubt this is obvious later but a description in the leader of
> the basic approach to splitting huge pages wouldn't kill.

Yes sure good idea, I added a comment in the most crucial point... not
in the header.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -628,11 +628,28 @@ static void __split_huge_page_refcount(s
 		 */
 		smp_wmb();
 
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
 		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
+
 		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
+
 		page_tail->index = ++head_index;
+
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-18 20:03   ` Mel Gorman
@ 2009-12-19 16:41     ` Andrea Arcangeli
  2009-12-21 20:31       ` Mel Gorman
  0 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-19 16:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

> On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote:
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > new file mode 100644
> > --- /dev/null
> > +++ b/include/linux/huge_mm.h
> > @@ -0,0 +1,110 @@
> > +#ifndef _LINUX_HUGE_MM_H
> > +#define _LINUX_HUGE_MM_H
> > +
> > +extern int do_huge_anonymous_page(struct mm_struct *mm,
> > +				  struct vm_area_struct *vma,
> > +				  unsigned long address, pmd_t *pmd,
> > +				  unsigned int flags);
> > +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> > +			 struct vm_area_struct *vma);
> > +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > +			   unsigned long address, pmd_t *pmd,
> > +			   pmd_t orig_pmd);
> > +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> 
On Fri, Dec 18, 2009 at 08:03:46PM +0000, Mel Gorman wrote:
> The naming of "huge" might bite in the ass later if/when transparent
> support is applied to multiple page sizes. Granted, it's not happening
> any time soon.

Granted ;). But why not huge? I think you just want to add "pmd" there
maybe, like do_huge_pmd_anonymous_page and do_huge_pmd_wp_page. The
other two already looks fine to me. Huge means it's part of the
hugepage support so I would keep it, otherwise you'd need to like a
name like get_pmd_pte (that is less intuitiv than get_pmd_huge_pte).

> > +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> > +					  unsigned long addr,
> > +					  pmd_t *pmd,
> > +					  unsigned int flags);
> > +extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
> > +			      struct vm_area_struct *vma,
> > +			      pmd_t *pmd);
> > +
> > +enum transparent_hugepage_flag {
> > +	TRANSPARENT_HUGEPAGE_FLAG,
> > +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> 
> Defrag is misleading. Glancing through the rest of the patch, "try harder"
> would be a more appropriate term because it uses __GFP_REPEAT.

No. Yes, open source has the defect that some people has nothing
better to do so they break visible kernel APIs in /sys /proc and marks
them obsoleted in make menuconfig and they go fixup userland often
with little practical gain but purely aesthetical purist reasons, so
to try to avoid that I just tried to make a visible kernel API that
has a chance to survive 1 year of development without breaking
userland.

That means calling it "defrag" because "defrag" eventually will
happen. Right now the best approximation is __GFP_REPEAT, so be it,
but the visible kernel API must be done in a way that isn't tied to
current internal implementation or cleverness of defrag. So please
help in fighting the constant API breakage in /sys and those OBSOLETE
marks in menuconfig (you may disable it if your userland is uptodate,
etc...).

In fact I ask you to review from the entirely opposite side, so
thinking more long term. Still trying not to overdesign though.

> 
> > diff --git a/mm/Makefile b/mm/Makefile
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
> >  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> >  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> >  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > new file mode 100644
> > --- /dev/null
> > +++ b/mm/huge_memory.c
> 
> Similar on naming. Later someone will get congused as to why there is
> hugetlbfs and huge_memory.

Why? Unless you want to change HUGETLBFS name too and remove HUGE from
there too, there's absolutely no reason to remove the huge name from
transparent hugepages. In fact to the contrary I used HUGE exactly
because that is how hugetlbfs call them! Otherwise I would have used
transparent largepages. Whatever hugetlbfs uses, transparent hugepage
also has to use that naming. Otherwise it's a mess. They're indentical
features, one is transparent allocated, the other is not and requires
an pseudo-fs to allocate them. The result, performance, and pagetable
layout generated is identical too.

I've no clue what confusion you're worried about here, I didn't call
this hugetlbfs. This is transparent_hugepage, and that seems really
strightforwad and obvious what it means (same thing, one through fs,
the other transparent).

You can argue I should have called it transparent_hugetlb! That one we
can argue about, but arguing about the "huge" doesn't make sense to
me.

If you want to rename this to transparent_hugetlb and I'll do it. So
it's even more clear the only difference between hugetlbfs and
transparent_hugetlb. But personally I think hugetlb is not the
appropriate name only for one reason: later we may want to use
hugepages on pagecache too, but those pagecache will be hugepages, but
not mapped by any tlb if they're the result of a read/write. In fact
this is true for hugetlbfs too when you read/write, no tlb involvement
at all! Which is why hugetlbfs should also renamed to hugepagefs if something!

> > +static int __do_huge_anonymous_page(struct mm_struct *mm,
> > +				    struct vm_area_struct *vma,
> > +				    unsigned long address, pmd_t *pmd,
> > +				    struct page *page,
> 
> Maybe this should be do_pmd_anonymous page and match what do_anonymous_page
> does as much as possible. This might offset any future problems related to
> transparently handling pages at other page table levels.

Why not do_huge_pmd_anonymous_page. This is an huge pmd after all and
as said above removing huge from all places it's going to screw over
some funtions that are specific for huge pmd and not for regular pmd.

I'll make this change, it's fine with me.

> > +				    unsigned long haddr)
> > +{
> > +	int ret = 0;
> > +	pgtable_t pgtable;
> > +
> > +	VM_BUG_ON(!PageCompound(page));
> > +	pgtable = pte_alloc_one(mm, address);
> > +	if (unlikely(!pgtable)) {
> > +		put_page(page);
> > +		return VM_FAULT_OOM;
> > +	}
> > +
> > +	clear_huge_page(page, haddr, HPAGE_NR);
> > +
> 
> Ideally insead of defining things like HPAGE_NR, the existing functions for
> multiple huge page sizes would be extended to return the "huge page size
> corresponding to a PMD".

You mean PMD_SIZE? Again this is the whole discussion if HPAGE should
be nuked as a whole in favour of PMD_something.

I'm unsure if favouring the PMD/PUD nomenclature is the way to go,
considering the main complaint one can have is for archs that may have
mixed page size that isn't a match of PMD/PUD at all! I'm open to
suggestions just worrying about huge PUD seems not realistic, while a
mixed page size that won't ever match pmd or pud is more
realistic. power can't do it as it can't fallback, but maybe ia64 or
others can do, I don't know. Surely anything realistic won't match
PUD this is my main reason for disliking binding the whole patch to
pmd sizes like if PUD sizes would be relevant.

> > +	__SetPageUptodate(page);
> > +	smp_wmb();
> > +
> 
> Need to explain why smp_wmb() is needed there. It doesn't look like
> you're protecting the bit set itself. More likely you are making sure
> the writes in clear_huge_page() have finished but that's a guess.
> Comment.

Yes. Same as __pte_alloc. You're not the first asking this, I'll add
comment.

> > +	spin_lock(&mm->page_table_lock);
> > +	if (unlikely(!pmd_none(*pmd))) {
> > +		put_page(page);
> > +		pte_free(mm, pgtable);
> 
> Racing fault already filled in the PTE? If so, comment please. Again,
> matching how do_anonymous_page() does a similar job would help
> comprehension.

Yes racing thread already mapped in a pmd large (or a pte if hugepage
allocation failed). Adding comment... ;)

> > +	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
> > +		if (unlikely(anon_vma_prepare(vma)))
> > +			return VM_FAULT_OOM;
> > +		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> > +				   (transparent_hugepage_defrag(vma) ?
> 
> GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable
> is set in /proc/sys/vm. This should be GFP_HIGHUSER only.

Why? Either we move htlb_alloc_mask into common code so that it exists
even when HUGETLBFS=n (like I had to do to share the copy_huge
routines to share as much as possible with hugetlbfs), or this should
remain movable to avoid crippling down the
feature. hugepages_treat_as_movable right now only applies to
hugetlbfs. We've only to decide if to apply it to transparent
hugepages too.

> > +	if (transparent_hugepage_debug_cow() && new_page) {
> > +		put_page(new_page);
> > +		new_page = NULL;
> > +	}
> > +	if (unlikely(!new_page)) {
> 
> This entire block needs be in a demote_pmd_page() or something similar.
> It's on the hefty side for being in the main function. That said, I
> didn't spot anything wrong in there either.

Yeah this is a cleanup I should do but it's not as easy as it looks or
I would have done it already when Adam asked me a few weeks ago.

> > +			}
> > +		}
> > +
> > +		spin_lock(&mm->page_table_lock);
> > +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> > +			goto out_free_pages;
> > +		else
> > +			get_page(page);
> > +		spin_unlock(&mm->page_table_lock);
> > +
> > +		might_sleep();
> 
> Is this check really necessary? We could already go alseep easier when
> allocating pages.

Ok, removed might_sleep().

> 
> > +		for (i = 0; i < HPAGE_NR; i++) {
> > +			copy_user_highpage(pages[i], page + i,
> 
> More nasty naming there. Needs to be cleared that pages is your demoted
> base pages and page is the existing compound page.

what exactly is not clear? You already asked to move this code into a
separate function. What else to document this is the "fallback" copy
to 4k pages? renaming pages[] to 4k_pages[] doesn't seem necessary to
me, besides copy_user_highpage work on PAGE_SIZE not HPAGE_SIZE.

> > +		pmd_t _pmd;
> > +		/*
> > +		 * We should set the dirty bit only for FOLL_WRITE but
> > +		 * for now the dirty bit in the pmd is meaningless.
> > +		 * And if the dirty bit will become meaningful and
> > +		 * we'll only set it with FOLL_WRITE, an atomic
> > +		 * set_bit will be required on the pmd to set the
> > +		 * young bit, instead of the current set_pmd_at.
> > +		 */
> > +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> > +		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
> > +	}
> > +	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;
> 
> More HPAGE vs PMD here.

All of them or none, not sure why you mention it on the MASK, maybe
it's just an accident. Every single HPAGE_SIZE has to be changed too!
Not just HPAGE_MASK, or it's pointless.

> > +static int __split_huge_page_splitting(struct page *page,
> > +				       struct vm_area_struct *vma,
> > +				       unsigned long address)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pmd_t *pmd;
> > +	int ret = 0;
> > +
> > +	spin_lock(&mm->page_table_lock);
> > +	pmd = page_check_address_pmd(page, mm, address,
> > +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> > +	if (pmd) {
> > +		/*
> > +		 * We can't temporarily set the pmd to null in order
> > +		 * to split it, pmd_huge must remain on at all times.
> > +		 */
> 
> Why, to avoid a double fault? Or to avoid a case where the huge page is
> being split, another fault occurs and zero-filled pages get faulted in?

Well initially I did pmdp_clear_flush and overwritten it. It was a
nasty race to find and fix, wasted some time on it. Yes once the pmd
is zero, anything can happen, it's like a page not faulted in yet, and
nobody will take the slow path of pmd_huge anymore to serialize
against the split_huge_page with anon_vma->lock.
 
> I'm afraid I ran out of time at this point. It'll be after the holidays
> before I get time for a proper go at it. Sorry.

Understood. The main trouble I see in your comments is the pmd vs huge
name. Please consider what I mentioned above about more realistic
different hpage sizes that won't match either pmd/pud. And the
pud_size being unusable until we get higher orders of magnitude of ram
sizes. Then decide if to change every single HPAGE to PMD or to stick
with this. I'm personally netural, I _never_ care about names, I only
care about what assembly gcc produces.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Swap on flash SSDs
  2009-12-18 19:39                 ` Ingo Molnar
  2009-12-18 20:13                   ` Linus Torvalds
@ 2009-12-19 18:38                   ` Jörn Engel
  1 sibling, 0 replies; 89+ messages in thread
From: Jörn Engel @ 2009-12-19 18:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Mike Travis, Christoph Lameter, Rik van Riel,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Mel Gorman,
	Andi Kleen, Benjamin Herrenschmidt, KAMEZAWA Hiroyuki,
	Chris Wright, Andrew Morton, Stephen C. Tweedie, Linus Torvalds

On Fri, 18 December 2009 20:39:11 +0100, Ingo Molnar wrote:
> 
> And even when a cell does go bad and all the spares are gone, the failure mode 
> is not catastrophic like with a hard disk, but that particular cell goes 
> read-only and you can still recover the info and use the remaining cells.

Pretty much all modern flash suffers write disturb and even read
disturb.  So if any cell (I guess you mean block?) goes read-only,
errors will start to accumulate and ultimately defeat error correction.

Yes, you only have a couple of bit flips.  A sufficiently motivated
human can salvage a lot of data from such a device.  But read-only does
not mean error-free.

Plus Linus' comment about firmware bugs, of course. ;)

JA?rn

-- 
Mundie uses a textbook tactic of manipulation: start with some
reasonable talk, and lead the audience to an unreasonable conclusion.
-- Bruce Perens

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-18 23:06       ` KAMEZAWA Hiroyuki
@ 2009-12-20 18:39         ` Andrea Arcangeli
  2009-12-21  0:26           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-20 18:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton

On Sat, Dec 19, 2009 at 08:06:50AM +0900, KAMEZAWA Hiroyuki wrote:
> My intentsion was adding a patch for adding "pagesize" parameters
> to charge/uncharge function may be able to reduce size of changes.

There's no need for that as my patch shows and I doubt it makes a lot
of difference at runtime, but it's up to you, I'm neutral. I suggest
is that you send me a patch and I integrate and use your version
;). I'll take care of adapting huge_memory.c myself if you want to add
the size param to the outer call.

Now if I manage to finish this khugepaged I could do a new submit with
a new round of changes and cleanups and stats (latest polishing
especially thanks to Mel review).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-20 18:39         ` Andrea Arcangeli
@ 2009-12-21  0:26           ` KAMEZAWA Hiroyuki
  2009-12-21  1:24             ` Daisuke Nishimura
  0 siblings, 1 reply; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-21  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, Andrew Morton, nishimura

On Sun, 20 Dec 2009 19:39:43 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Sat, Dec 19, 2009 at 08:06:50AM +0900, KAMEZAWA Hiroyuki wrote:
> > My intentsion was adding a patch for adding "pagesize" parameters
> > to charge/uncharge function may be able to reduce size of changes.
> 
> There's no need for that as my patch shows and I doubt it makes a lot
> of difference at runtime, but it's up to you, I'm neutral. I suggest
> is that you send me a patch and I integrate and use your version
> ;). I'll take care of adapting huge_memory.c myself if you want to add
> the size param to the outer call.
> 
Added CC: to Nishimura.

Andrea, Please go ahead as you like. My only concern is a confliction with
Nishimura's work. He's preparing a patch for "task move", which has been
requested since the start of memcg. He've done really good jobs and enough
tests in these 2 months.

So, what I think now is to merge Nishimura's to mmotm first and import your
patches on it if Nishimura-san can post ready-to-merge version in this year.
Nishimura-san, what do you think ?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-21  0:26           ` KAMEZAWA Hiroyuki
@ 2009-12-21  1:24             ` Daisuke Nishimura
  2009-12-21  3:52               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 89+ messages in thread
From: Daisuke Nishimura @ 2009-12-21  1:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton, Daisuke Nishimura

On Mon, 21 Dec 2009 09:26:25 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Sun, 20 Dec 2009 19:39:43 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Sat, Dec 19, 2009 at 08:06:50AM +0900, KAMEZAWA Hiroyuki wrote:
> > > My intentsion was adding a patch for adding "pagesize" parameters
> > > to charge/uncharge function may be able to reduce size of changes.
> > 
> > There's no need for that as my patch shows and I doubt it makes a lot
> > of difference at runtime, but it's up to you, I'm neutral. I suggest
> > is that you send me a patch and I integrate and use your version
> > ;). I'll take care of adapting huge_memory.c myself if you want to add
> > the size param to the outer call.
> > 
> Added CC: to Nishimura.
> 
> Andrea, Please go ahead as you like. My only concern is a confliction with
> Nishimura's work.
I agree. I've already noticed Andrea's patches but not read through all the
patches yet, sorry.

One concern: isn't there any inconsistency to handle css->refcnt in charging/uncharging
compound pages the same way as a normal page ?

> He's preparing a patch for "task move", which has been
> requested since the start of memcg. He've done really good jobs and enough
> tests in these 2 months.
> 
> So, what I think now is to merge Nishimura's to mmotm first and import your
> patches on it if Nishimura-san can post ready-to-merge version in this year.
> Nishimura-san, what do you think ?
> 
I would say, "yes. I agree with you" ;)
Anyway, I'm preparing my patches for next post, in which I've fixed the bug
I found in previous(Dec/14) version. I'll post them today or tomorrow at the latest
and I think they are ready to be merged.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-21  1:24             ` Daisuke Nishimura
@ 2009-12-21  3:52               ` KAMEZAWA Hiroyuki
  2009-12-21  4:33                 ` Daisuke Nishimura
  0 siblings, 1 reply; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-21  3:52 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

On Mon, 21 Dec 2009 10:24:27 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 21 Dec 2009 09:26:25 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Added CC: to Nishimura.
> > 
> > Andrea, Please go ahead as you like. My only concern is a confliction with
> > Nishimura's work.
> I agree. I've already noticed Andrea's patches but not read through all the
> patches yet, sorry.
> 
> One concern: isn't there any inconsistency to handle css->refcnt in charging/uncharging
> compound pages the same way as a normal page ?
> 
AKAIK, no inconsistency.
My biggest concern is that page-table-walker has to handle hugepages. 


> > He's preparing a patch for "task move", which has been
> > requested since the start of memcg. He've done really good jobs and enough
> > tests in these 2 months.
> > 
> > So, what I think now is to merge Nishimura's to mmotm first and import your
> > patches on it if Nishimura-san can post ready-to-merge version in this year.
> > Nishimura-san, what do you think ?
> > 
> I would say, "yes. I agree with you" ;)
> Anyway, I'm preparing my patches for next post, in which I've fixed the bug
> I found in previous(Dec/14) version. I'll post them today or tomorrow at the latest
> and I think they are ready to be merged.
> 
Ok, great.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-21  3:52               ` KAMEZAWA Hiroyuki
@ 2009-12-21  4:33                 ` Daisuke Nishimura
  2009-12-25  4:17                   ` Daisuke Nishimura
  0 siblings, 1 reply; 89+ messages in thread
From: Daisuke Nishimura @ 2009-12-21  4:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton, Daisuke Nishimura

On Mon, 21 Dec 2009 12:52:23 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 21 Dec 2009 10:24:27 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 21 Dec 2009 09:26:25 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > Added CC: to Nishimura.
> > > 
> > > Andrea, Please go ahead as you like. My only concern is a confliction with
> > > Nishimura's work.
> > I agree. I've already noticed Andrea's patches but not read through all the
> > patches yet, sorry.
> > 
> > One concern: isn't there any inconsistency to handle css->refcnt in charging/uncharging
> > compound pages the same way as a normal page ?
> > 
> AKAIK, no inconsistency.
O.K. thanks.
(It might be better for us to remove per page css refcnt till 2.6.34...)

> My biggest concern is that page-table-walker has to handle hugepages. 
> 
Ah, you're right.
It would be a big change..


Thanks,
Daisuke Nishimura

> 
> > > He's preparing a patch for "task move", which has been
> > > requested since the start of memcg. He've done really good jobs and enough
> > > tests in these 2 months.
> > > 
> > > So, what I think now is to merge Nishimura's to mmotm first and import your
> > > patches on it if Nishimura-san can post ready-to-merge version in this year.
> > > Nishimura-san, what do you think ?
> > > 
> > I would say, "yes. I agree with you" ;)
> > Anyway, I'm preparing my patches for next post, in which I've fixed the bug
> > I found in previous(Dec/14) version. I'll post them today or tomorrow at the latest
> > and I think they are ready to be merged.
> > 
> Ok, great.
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 14 of 28] pte alloc trans splitting
  2009-12-19 15:59     ` Andrea Arcangeli
@ 2009-12-21 19:57       ` Mel Gorman
  0 siblings, 0 replies; 89+ messages in thread
From: Mel Gorman @ 2009-12-21 19:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Sat, Dec 19, 2009 at 04:59:48PM +0100, Andrea Arcangeli wrote:
> On Fri, Dec 18, 2009 at 07:03:34PM +0000, Mel Gorman wrote:
> > On Thu, Dec 17, 2009 at 07:00:17PM -0000, Andrea Arcangeli wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > pte alloc routines must wait for split_huge_page if the pmd is not
> > > present and not null (i.e. pmd_trans_splitting).
> > 
> > More stupid questions. When a large page is about to be split, you clear the
> > present bit to cause faults and hold those accesses until the split completes?
> 
> That was previous version. New version doesn't clear the present bit
> but sets its own reserved bit in the pmd. All we have to protect is
> kernel code, not userland. We have to protect against anything that
> will change the mapcount. The mapcount is the key here, as it is only
> accounted in the head page and it has to be transferred to all tail
> pages during the split. So during the split the mapcount can't
> change. But that doesn't mean userland can't keep changing and reading
> the page contents while we transfer the mapcount.
> 

Ok, that makes sense. By having pte_alloc wait on splt_huge_page, it
should be safe even if userspace calls fork(). No other gotcha springs
to mind.

> > Again, no doubt this is obvious later but a description in the leader of
> > the basic approach to splitting huge pages wouldn't kill.
> 
> Yes sure good idea, I added a comment in the most crucial point... not
> in the header.
> 

Thanks.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -628,11 +628,28 @@ static void __split_huge_page_refcount(s
>  		 */
>  		smp_wmb();
>  
> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
>  		BUG_ON(page_mapcount(page_tail));
>  		page_tail->_mapcount = page->_mapcount;
> +
>  		BUG_ON(page_tail->mapping);
>  		page_tail->mapping = page->mapping;
> +
>  		page_tail->index = ++head_index;
> +
>  		BUG_ON(!PageAnon(page_tail));
>  		BUG_ON(!PageUptodate(page_tail));
>  		BUG_ON(!PageDirty(page_tail));
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-19 16:41     ` Andrea Arcangeli
@ 2009-12-21 20:31       ` Mel Gorman
  2009-12-23  0:06         ` Andrea Arcangeli
  0 siblings, 1 reply; 89+ messages in thread
From: Mel Gorman @ 2009-12-21 20:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Sat, Dec 19, 2009 at 05:41:43PM +0100, Andrea Arcangeli wrote:
> > On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote:
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > new file mode 100644
> > > --- /dev/null
> > > +++ b/include/linux/huge_mm.h
> > > @@ -0,0 +1,110 @@
> > > +#ifndef _LINUX_HUGE_MM_H
> > > +#define _LINUX_HUGE_MM_H
> > > +
> > > +extern int do_huge_anonymous_page(struct mm_struct *mm,
> > > +				  struct vm_area_struct *vma,
> > > +				  unsigned long address, pmd_t *pmd,
> > > +				  unsigned int flags);
> > > +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> > > +			 struct vm_area_struct *vma);
> > > +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > > +			   unsigned long address, pmd_t *pmd,
> > > +			   pmd_t orig_pmd);
> > > +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> > 
> On Fri, Dec 18, 2009 at 08:03:46PM +0000, Mel Gorman wrote:
> > The naming of "huge" might bite in the ass later if/when transparent
> > support is applied to multiple page sizes. Granted, it's not happening
> > any time soon.
> 
> Granted ;). But why not huge? I think you just want to add "pmd" there
> maybe, like do_huge_pmd_anonymous_page and do_huge_pmd_wp_page. The
> other two already looks fine to me. Huge means it's part of the
> hugepage support so I would keep it, otherwise you'd need to like a
> name like get_pmd_pte (that is less intuitiv than get_pmd_huge_pte).
> 

My vague worry is that multiple huge page sizes are currently supported in
hugetlbfs but transparent support is obviously tied to the page-table level
it's implemented for. In the future, the term "huge" could be ambiguous . How
about instead of things like HUGE_MASK, it would be HUGE_PMD_MASK? It's not
something I feel very strongly about as eventually I'll remember what sort of
"huge" is meant in each context.

> > > +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> > > +					  unsigned long addr,
> > > +					  pmd_t *pmd,
> > > +					  unsigned int flags);
> > > +extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
> > > +			      struct vm_area_struct *vma,
> > > +			      pmd_t *pmd);
> > > +
> > > +enum transparent_hugepage_flag {
> > > +	TRANSPARENT_HUGEPAGE_FLAG,
> > > +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > > +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> > 
> > Defrag is misleading. Glancing through the rest of the patch, "try harder"
> > would be a more appropriate term because it uses __GFP_REPEAT.
> 
> No. Yes, open source has the defect that some people has nothing
> better to do so they break visible kernel APIs in /sys /proc and marks
> them obsoleted in make menuconfig and they go fixup userland often
> with little practical gain but purely aesthetical purist reasons, so
> to try to avoid that I just tried to make a visible kernel API that
> has a chance to survive 1 year of development without breaking
> userland.
> 
> That means calling it "defrag" because "defrag" eventually will
> happen. Right now the best approximation is __GFP_REPEAT, so be it,
> but the visible kernel API must be done in a way that isn't tied to
> current internal implementation or cleverness of defrag. So please
> help in fighting the constant API breakage in /sys and those OBSOLETE
> marks in menuconfig (you may disable it if your userland is uptodate,
> etc...).
> 

You've fully convinced me. Put a comment there to the effect of

/*
 * Currently uses  __GFP_REPEAT during allocation. Should be implemented
 * using page migration in the future
 */

> In fact I ask you to review from the entirely opposite side, so
> thinking more long term. Still trying not to overdesign though.
> 
> > 
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
> > >  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> > >  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> > >  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> > > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > new file mode 100644
> > > --- /dev/null
> > > +++ b/mm/huge_memory.c
> > 
> > Similar on naming. Later someone will get congused as to why there is
> > hugetlbfs and huge_memory.
> 
> Why? Unless you want to change HUGETLBFS name too and remove HUGE from
> there too, there's absolutely no reason to remove the huge name from
> transparent hugepages. In fact to the contrary I used HUGE exactly
> because that is how hugetlbfs call them! Otherwise I would have used
> transparent largepages. Whatever hugetlbfs uses, transparent hugepage
> also has to use that naming. Otherwise it's a mess. They're indentical
> features, one is transparent allocated, the other is not and requires
> an pseudo-fs to allocate them. The result, performance, and pagetable
> layout generated is identical too.
> 
> I've no clue what confusion you're worried about here,

I was looking at it from the wrong angle. I saw the name in the context on
mm/hugetlb.c and felt it was unclear. Now that I look at it again, I should
have seen it as a "huge" version of memory.c. Sorry for the noise.

> I didn't call
> this hugetlbfs. This is transparent_hugepage, and that seems really
> strightforwad and obvious what it means (same thing, one through fs,
> the other transparent).
> 
> You can argue I should have called it transparent_hugetlb! That one we
> can argue about, but arguing about the "huge" doesn't make sense to
> me.
> 
> If you want to rename this to transparent_hugetlb and I'll do it. So
> it's even more clear the only difference between hugetlbfs and
> transparent_hugetlb.

Leave it as-is. I'm not seeing it as a huge version of memory.c and it's
clearer.

> But personally I think hugetlb is not the
> appropriate name only for one reason: later we may want to use
> hugepages on pagecache too, but those pagecache will be hugepages, but
> not mapped by any tlb if they're the result of a read/write. In fact
> this is true for hugetlbfs too when you read/write, no tlb involvement
> at all! Which is why hugetlbfs should also renamed to hugepagefs if something!
> 
> > > +static int __do_huge_anonymous_page(struct mm_struct *mm,
> > > +				    struct vm_area_struct *vma,
> > > +				    unsigned long address, pmd_t *pmd,
> > > +				    struct page *page,
> > 
> > Maybe this should be do_pmd_anonymous page and match what do_anonymous_page
> > does as much as possible. This might offset any future problems related to
> > transparently handling pages at other page table levels.
> 
> Why not do_huge_pmd_anonymous_page. This is an huge pmd after all and
> as said above removing huge from all places it's going to screw over
> some funtions that are specific for huge pmd and not for regular pmd.
> 

do_huge_pmd_anonymous_page makes sense.

> I'll make this change, it's fine with me.
> 
> > > +				    unsigned long haddr)
> > > +{
> > > +	int ret = 0;
> > > +	pgtable_t pgtable;
> > > +
> > > +	VM_BUG_ON(!PageCompound(page));
> > > +	pgtable = pte_alloc_one(mm, address);
> > > +	if (unlikely(!pgtable)) {
> > > +		put_page(page);
> > > +		return VM_FAULT_OOM;
> > > +	}
> > > +
> > > +	clear_huge_page(page, haddr, HPAGE_NR);
> > > +
> > 
> > Ideally insead of defining things like HPAGE_NR, the existing functions for
> > multiple huge page sizes would be extended to return the "huge page size
> > corresponding to a PMD".
> 
> You mean PMD_SIZE? Again this is the whole discussion if HPAGE should
> be nuked as a whole in favour of PMD_something.
> 
> I'm unsure if favouring the PMD/PUD nomenclature is the way to go,
> considering the main complaint one can have is for archs that may have
> mixed page size that isn't a match of PMD/PUD at all!

As it's currently tied to the PMD, the naming should reflect it. If an
architecture does want to have transparent hugepage support but the target
size is not at the PMD level, they will need to make some major modifications
anyway. I'm effectively off-line in terms of access to sources and work
material so at the moment, I'm having trouble seeing how an architecture
would handle the problem.

> I'm open to
> suggestions just worrying about huge PUD seems not realistic, while a
> mixed page size that won't ever match pmd or pud is more
> realistic. power can't do it as it can't fallback, but maybe ia64 or
> others can do, I don't know.

IA-64 can't in its currently implementation. Due to the page table format
they use, huge pages can only be mapped at specific ranges in the virtual
address space. If the long-format version of the page table was used, they
would be able to but I bet it's not happening any time soon. The best bet
for other architectures supporting this would be sparc and maybe sh.
It might be worth poking Paul Mundt in particular because he expressed
an interest in transparent support of some sort in the past for sh.

> Surely anything realistic won't match
> PUD this is my main reason for disliking binding the whole patch to
> pmd sizes like if PUD sizes would be relevant.
> 

Again, I'm not going to make a major issue of it. It'd be my preference
but chances are I'll stop caring once I've read the patchset three or
four more times.

> > > +	__SetPageUptodate(page);
> > > +	smp_wmb();
> > > +
> > 
> > Need to explain why smp_wmb() is needed there. It doesn't look like
> > you're protecting the bit set itself. More likely you are making sure
> > the writes in clear_huge_page() have finished but that's a guess.
> > Comment.
> 
> Yes. Same as __pte_alloc. You're not the first asking this, I'll add
> comment.
> 

Thanks

> > > +	spin_lock(&mm->page_table_lock);
> > > +	if (unlikely(!pmd_none(*pmd))) {
> > > +		put_page(page);
> > > +		pte_free(mm, pgtable);
> > 
> > Racing fault already filled in the PTE? If so, comment please. Again,
> > matching how do_anonymous_page() does a similar job would help
> > comprehension.
> 
> Yes racing thread already mapped in a pmd large (or a pte if hugepage
> allocation failed). Adding comment... ;)
> 

Thanks

> > > +	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
> > > +		if (unlikely(anon_vma_prepare(vma)))
> > > +			return VM_FAULT_OOM;
> > > +		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
> > > +				   (transparent_hugepage_defrag(vma) ?
> > 
> > GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable
> > is set in /proc/sys/vm. This should be GFP_HIGHUSER only.
> 
> Why?

Because huge pages cannot move. If the MOVABLE zone has been set up to
guarantee memory hot-plug removal, they don't want huge pages to be
getting in the way. To allow unconditional use of GFP_HIGHUSER_MOVABLE,
memory hotplug would have to know it can demote all the transparent huge
pages and migrate them that way.

> Either we move htlb_alloc_mask into common code so that it exists
> even when HUGETLBFS=n (like I had to do to share the copy_huge
> routines to share as much as possible with hugetlbfs), or this should
> remain movable to avoid crippling down the
> feature.

My preference would be to move the alloc_mask into common code or at
least make it available via mm/internal.h because otherwise this will
collide with memory hot-remove in the future.

> hugepages_treat_as_movable right now only applies to
> hugetlbfs. We've only to decide if to apply it to transparent
> hugepages too.
> 

I see no problem with applying it to transparent hugepages as well.

> > > +	if (transparent_hugepage_debug_cow() && new_page) {
> > > +		put_page(new_page);
> > > +		new_page = NULL;
> > > +	}
> > > +	if (unlikely(!new_page)) {
> > 
> > This entire block needs be in a demote_pmd_page() or something similar.
> > It's on the hefty side for being in the main function. That said, I
> > didn't spot anything wrong in there either.
> 
> Yeah this is a cleanup I should do but it's not as easy as it looks or
> I would have done it already when Adam asked me a few weeks ago.
> 

Ok.

> > > +			}
> > > +		}
> > > +
> > > +		spin_lock(&mm->page_table_lock);
> > > +		if (unlikely(!pmd_same(*pmd, orig_pmd)))
> > > +			goto out_free_pages;
> > > +		else
> > > +			get_page(page);
> > > +		spin_unlock(&mm->page_table_lock);
> > > +
> > > +		might_sleep();
> > 
> > Is this check really necessary? We could already go alseep easier when
> > allocating pages.
> 
> Ok, removed might_sleep().
> 
> > 
> > > +		for (i = 0; i < HPAGE_NR; i++) {
> > > +			copy_user_highpage(pages[i], page + i,
> > 
> > More nasty naming there. Needs to be cleared that pages is your demoted
> > base pages and page is the existing compound page.
> 
> what exactly is not clear?

Because ordinarily "pages" is just the plural of page. It was not
immediately clear that this was dest_pages and src_page. I guess if it
was in a separate function, it would have been a lot more obvious.

> You already asked to move this code into a
> separate function. What else to document this is the "fallback" copy
> to 4k pages? renaming pages[] to 4k_pages[] doesn't seem necessary to
> me, besides copy_user_highpage work on PAGE_SIZE not HPAGE_SIZE.
> 
> > > +		pmd_t _pmd;
> > > +		/*
> > > +		 * We should set the dirty bit only for FOLL_WRITE but
> > > +		 * for now the dirty bit in the pmd is meaningless.
> > > +		 * And if the dirty bit will become meaningful and
> > > +		 * we'll only set it with FOLL_WRITE, an atomic
> > > +		 * set_bit will be required on the pmd to set the
> > > +		 * young bit, instead of the current set_pmd_at.
> > > +		 */
> > > +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> > > +		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
> > > +	}
> > > +	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;
> > 
> > More HPAGE vs PMD here.
> 
> All of them or none, not sure why you mention it on the MASK, maybe
> it's just an accident. Every single HPAGE_SIZE has to be changed too!

Yes, it would. I didn't point it out every time.

> Not just HPAGE_MASK, or it's pointless.
> 
> > > +static int __split_huge_page_splitting(struct page *page,
> > > +				       struct vm_area_struct *vma,
> > > +				       unsigned long address)
> > > +{
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	pmd_t *pmd;
> > > +	int ret = 0;
> > > +
> > > +	spin_lock(&mm->page_table_lock);
> > > +	pmd = page_check_address_pmd(page, mm, address,
> > > +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> > > +	if (pmd) {
> > > +		/*
> > > +		 * We can't temporarily set the pmd to null in order
> > > +		 * to split it, pmd_huge must remain on at all times.
> > > +		 */
> > 
> > Why, to avoid a double fault? Or to avoid a case where the huge page is
> > being split, another fault occurs and zero-filled pages get faulted in?
> 
> Well initially I did pmdp_clear_flush and overwritten it. It was a
> nasty race to find and fix, wasted some time on it.

Ok

> once the pmd
> is zero, anything can happen, it's like a page not faulted in yet,
> nobody will take the slow path of pmd_huge anymore to serialize
> against the split_huge_page with anon_vma->lock.
>  

Thanks for the explanation.


> > I'm afraid I ran out of time at this point. It'll be after the holidays
> > before I get time for a proper go at it. Sorry.
> 
> Understood. The main trouble I see in your comments is the pmd vs huge
> name. Please consider what I mentioned above about more realistic
> different hpage sizes that won't match either pmd/pud. And the
> pud_size being unusable until we get higher orders of magnitude of ram
> sizes. Then decide if to change every single HPAGE to PMD or to stick
> with this. I'm personally netural, I _never_ care about names, I only
> care about what assembly gcc produces.
> 

I would prefer pmd to be added to the huge names. However, this was
mostly to aid comprehension of the patchset when I was taking a quick
read. Once I get the chance to read it often enough, I'll care a lot
less.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-21 20:31       ` Mel Gorman
@ 2009-12-23  0:06         ` Andrea Arcangeli
  2009-12-23  6:09           ` Paul Mundt
  2010-01-03 18:38           ` Mel Gorman
  0 siblings, 2 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-23  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton, Paul Mundt

On Mon, Dec 21, 2009 at 08:31:50PM +0000, Mel Gorman wrote:
> My vague worry is that multiple huge page sizes are currently supported in
> hugetlbfs but transparent support is obviously tied to the page-table level
> it's implemented for. In the future, the term "huge" could be ambiguous . How
> about instead of things like HUGE_MASK, it would be HUGE_PMD_MASK? It's not
> something I feel very strongly about as eventually I'll remember what sort of
> "huge" is meant in each context.

Ok this naming seems to be a little troublesome. HUGE_PMD_MASK would
then require HUGE_PMD_SIZE. That is confusing a little to me, that is
the size of the page not of the pmd... Maybe HPAGE_PMD_SIZE is better?
Overall this is just one #define and search and replace, I can do that
if people likes it more than HPAGE_SIZE.

> /*
>  * Currently uses  __GFP_REPEAT during allocation. Should be implemented
>  * using page migration in the future
>  */

Done! thanks.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -75,6 +75,11 @@ static ssize_t enabled_store(struct kobj
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
 
+/*
+ * Currently uses __GFP_REPEAT during allocation. Should be
+ * implemented using page migration and real defrag algorithms in
+ * future VM.
+ */
 static ssize_t defrag_show(struct kobject *kobj,
 			   struct kobj_attribute *attr, char *buf)
 {

> do_huge_pmd_anonymous_page makes sense.

Agreed, I already changed all methods called from memory.c to
huge_memory.c with a "huge_pmd" prefix instead of just "huge".

> IA-64 can't in its currently implementation. Due to the page table format
> they use, huge pages can only be mapped at specific ranges in the virtual
> address space. If the long-format version of the page table was used, they

Hmm ok, so it sounds like hugetlbfs limitations are a software feature
for ia64 too.

> would be able to but I bet it's not happening any time soon. The best bet
> for other architectures supporting this would be sparc and maybe sh.
> It might be worth poking Paul Mundt in particular because he expressed
> an interest in transparent support of some sort in the past for sh.

I added him to CC.

> Because huge pages cannot move. If the MOVABLE zone has been set up to
> guarantee memory hot-plug removal, they don't want huge pages to be
> getting in the way. To allow unconditional use of GFP_HIGHUSER_MOVABLE,
> memory hotplug would have to know it can demote all the transparent huge
> pages and migrate them that way.

It should already do. migrate.c calls try_to_unmap that will split
them and migrate them just fine. If they can't be migrated I will
remove GFP_HIGHUSER_MOVABLE but I think they can already. migrate.c
can't notice the difference.

> My preference would be to move the alloc_mask into common code or at
> least make it available via mm/internal.h because otherwise this will
> collide with memory hot-remove in the future.

We can do that. But what I don't understand is why do_anonymous_page
ses an unconditional GFP_HIGHUSER_MOVABLE. If there's no benefit to
do_anonymous_page to turn off the gfp movable flag, I don't see why it
could be beneficial to turn it off on hugepages. If there's good
reason for that we surely can make it conditional into common code. I
didn't look too hard for it, but what is the reason there is this flag
in hugetlbfs?

> I would prefer pmd to be added to the huge names. However, this was
> mostly to aid comprehension of the patchset when I was taking a quick

That is neutral to me... it's just that HPAGE_SIZE already existed so
I tried to avoid adding unnecessary things but I'm not against
HPAGE_PMD_SIZE, that will make it more clearer this is the size of a
hugepage mapped by a pmd (and not a gigapage mapped by pud).

Thanks for the help! (we'll need more of your help in the defrag area
too according to comment added above ;)

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-23  0:06         ` Andrea Arcangeli
@ 2009-12-23  6:09           ` Paul Mundt
  2010-01-03 18:38           ` Mel Gorman
  1 sibling, 0 replies; 89+ messages in thread
From: Paul Mundt @ 2009-12-23  6:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton, linux-sh

On Wed, Dec 23, 2009 at 01:06:40AM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 21, 2009 at 08:31:50PM +0000, Mel Gorman wrote:
> > IA-64 can't in its currently implementation. Due to the page table format
> > they use, huge pages can only be mapped at specific ranges in the virtual
> > address space. If the long-format version of the page table was used, they
> 
> Hmm ok, so it sounds like hugetlbfs limitations are a software feature
> for ia64 too.
> 
> > would be able to but I bet it's not happening any time soon. The best bet
> > for other architectures supporting this would be sparc and maybe sh.
> > It might be worth poking Paul Mundt in particular because he expressed
> > an interest in transparent support of some sort in the past for sh.
> 
> I added him to CC.
> 
Thanks. It's probably worth going over a bit of background of the SH TLB
and the hugetlb support. For starters, it's a software loaded TLB, and
while we have 2-levels in hardware, extra levels do get abused in
software for certain configurations.

Varying page sizes are just PTE attributes and these are supported at
4kB, 8kB, 64kB, 256kB, 1MB, 4MB, and 64MB on general parts. SH-5 also has
a 512MB page size, but this tends to mainly be used for fixed-purpose
section mappings. Where the system page sizes stop and the hugetlb sizes
start are pretty arbitrary, generally these were from 64kB and up, but
there are systems using a 64kB PAGE_SIZE as well in which case the
huge pages start at the next available size (you can see the dependencies
for these in arch/sh/mm/Kconfig).

Beyond that, there is also a section mapping buffer (PMB) that supports
sizes of 16MB, 64MB, 128MB, and 512MB. This has no miss exception
associated with it, or permission bits, so only tends to get used for
large kernel mappings (it has a wide range of differing cache attributes
at least, and all entries are pre-faulted). ioremap() backs through this
transparently at the moment, but there is no hugetlb support for it yet.
If hugetlb is going to become more transparent on the other hand, then
it's certainly worth looking at doing support for something like this at
the PMD level with special attributes and piggybacking the TLB miss. The
closest example to this on any other platform would probably be the PPC
SLB, which also seems to be a bit more capable.

As we have a software managed TLB, most of what I've toyed with in
regards to transparency has been using larger TLBs for contiguous page
ranges from the TLB miss while retaining a smaller PAGE_SIZE. We tend not
to have very many > 1 order contiguous allocations though, so 64kB and up
TLBs rarely get loaded. Some folks (it might have been Christoph) were
doing similar things on IA-64 by using special encodings for size and
section placement hinting, but I don't recall what became of this. There
were also some ARM folks who had attempted to do similar things by
scanning at set_pte() time at least for the XScale parts (due to having
to contend with hardware table walking), but that seems to have been
abandoned.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-17 19:00 ` [PATCH 28 of 28] memcg huge memory Andrea Arcangeli
  2009-12-18  1:33   ` KAMEZAWA Hiroyuki
@ 2009-12-24 10:00   ` Balbir Singh
  2009-12-24 11:40     ` Andrea Arcangeli
  1 sibling, 1 reply; 89+ messages in thread
From: Balbir Singh @ 2009-12-24 10:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

* Andrea Arcangeli <aarcange@redhat.com> [2009-12-17 19:00:31]:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add memcg charge/uncharge to hugepage faults in huge_memory.c.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -207,6 +207,7 @@ static int __do_huge_anonymous_page(stru
>  	VM_BUG_ON(!PageCompound(page));
>  	pgtable = pte_alloc_one(mm, address);
>  	if (unlikely(!pgtable)) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -218,6 +219,7 @@ static int __do_huge_anonymous_page(stru
> 
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -251,6 +253,10 @@ int do_huge_anonymous_page(struct mm_str
>  				   HPAGE_ORDER);
>  		if (unlikely(!page))
>  			goto out;
> +		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
> +			put_page(page);
> +			goto out;
> +		}
> 
>  		return __do_huge_anonymous_page(mm, vma,
>  						address, pmd,
> @@ -379,9 +385,16 @@ int do_huge_wp_page(struct mm_struct *mm
>  		for (i = 0; i < HPAGE_NR; i++) {
>  			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>  						  vma, address);
> -			if (unlikely(!pages[i])) {
> -				while (--i >= 0)
> +			if (unlikely(!pages[i] ||
> +				     mem_cgroup_newpage_charge(pages[i],
> +							       mm,
> +							       GFP_KERNEL))) {
> +				if (pages[i])
>  					put_page(pages[i]);
> +				while (--i >= 0) {
> +					mem_cgroup_uncharge_page(pages[i]);
> +					put_page(pages[i]);
> +				}
>  				kfree(pages);
>  				ret |= VM_FAULT_OOM;
>  				goto out;
> @@ -439,15 +452,21 @@ int do_huge_wp_page(struct mm_struct *mm
>  		goto out;
>  	}
> 
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		put_page(new_page);
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
>  	copy_huge_page(new_page, page, haddr, vma, HPAGE_NR);
>  	__SetPageUptodate(new_page);
> 
>  	smp_wmb();
> 
>  	spin_lock(&mm->page_table_lock);
> -	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> +		mem_cgroup_uncharge_page(new_page);
>  		put_page(new_page);
> -	else {
> +	} else {
>  		pmd_t entry;
>  		entry = mk_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -466,8 +485,10 @@ out:
>  	return ret;
> 
>  out_free_pages:
> -	for (i = 0; i < HPAGE_NR; i++)
> +	for (i = 0; i < HPAGE_NR; i++) {
> +		mem_cgroup_uncharge_page(pages[i]);
>  		put_page(pages[i]);
> +	}
>  	kfree(pages);
>  	goto out_unlock;
>  }
>

Charging huge pages might be OK, but I wonder if we should create a
separate counter since hugepages are not reclaimable.  I am yet to
look at the complete series, does this series make hugepages
reclaimable? Could you please update Documentation/cgroups/memcg* as
well.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-24 10:00   ` Balbir Singh
@ 2009-12-24 11:40     ` Andrea Arcangeli
  2009-12-24 12:07       ` Balbir Singh
  0 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2009-12-24 11:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

On Thu, Dec 24, 2009 at 03:30:30PM +0530, Balbir Singh wrote:
> Charging huge pages might be OK, but I wonder if we should create a
> separate counter since hugepages are not reclaimable.  I am yet to
> look at the complete series, does this series make hugepages
> reclaimable? Could you please update Documentation/cgroups/memcg* as
> well.

The transparent hugepage that you quoted are reclaimable (actually
swappable/pageable, reclaimable isn't exact term for them), but the
point is that you can't see the different from userland so they can't
deserve a special counter. The hugetlbfs pages (not in patch above)
are still not swappable but they're not relevant with this
patchset. The whole point of transparent hugepage is that the user
shouldn't even notice they exist and it'll be the kernel to decide if
it worth using them or not, and when to split them if needed. Apps
however would better use madvise(MADV_HUGEPAGE) on large chunks of
malloc memory that will benefit from hugepages, because certain users
like embedded may want to turn off hugepages in all areas except the
ones marked by madvise. Transparent Hugepages may or may not generate
some minor memory and CPU waste depending on usage, so for memory
constrained devices it worth enabling them only where they generate
zero memory loss and zero CPU loss (even the prelloacted pte that is
required to guarantee success of split_huge_page would have been
allocated anyway if hugepages were disabled).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-24 11:40     ` Andrea Arcangeli
@ 2009-12-24 12:07       ` Balbir Singh
  0 siblings, 0 replies; 89+ messages in thread
From: Balbir Singh @ 2009-12-24 12:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

* Andrea Arcangeli <aarcange@redhat.com> [2009-12-24 12:40:25]:

> On Thu, Dec 24, 2009 at 03:30:30PM +0530, Balbir Singh wrote:
> > Charging huge pages might be OK, but I wonder if we should create a
> > separate counter since hugepages are not reclaimable.  I am yet to
> > look at the complete series, does this series make hugepages
> > reclaimable? Could you please update Documentation/cgroups/memcg* as
> > well.
> 
> The transparent hugepage that you quoted are reclaimable (actually
> swappable/pageable, reclaimable isn't exact term for them), but the
> point is that you can't see the different from userland so they can't
> deserve a special counter. The hugetlbfs pages (not in patch above)
> are still not swappable but they're not relevant with this
> patchset. The whole point of transparent hugepage is that the user
> shouldn't even notice they exist and it'll be the kernel to decide if
> it worth using them or not, and when to split them if needed. Apps
> however would better use madvise(MADV_HUGEPAGE) on large chunks of
> malloc memory that will benefit from hugepages, because certain users
> like embedded may want to turn off hugepages in all areas except the
> ones marked by madvise. Transparent Hugepages may or may not generate
> some minor memory and CPU waste depending on usage, so for memory
> constrained devices it worth enabling them only where they generate
> zero memory loss and zero CPU loss (even the prelloacted pte that is
> required to guarantee success of split_huge_page would have been
> allocated anyway if hugepages were disabled).

The concern with hugepages (not transparent), is that they are locked
and might cause frequent OOM. I think Kame raised this point as well.
Thanks for clarifying the patch though.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-21  4:33                 ` Daisuke Nishimura
@ 2009-12-25  4:17                   ` Daisuke Nishimura
  2009-12-25  4:37                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 89+ messages in thread
From: Daisuke Nishimura @ 2009-12-25  4:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton, Daisuke Nishimura

On Mon, 21 Dec 2009 13:33:15 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Mon, 21 Dec 2009 12:52:23 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 21 Dec 2009 10:24:27 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Mon, 21 Dec 2009 09:26:25 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > Added CC: to Nishimura.
> > > > 
> > > > Andrea, Please go ahead as you like. My only concern is a confliction with
> > > > Nishimura's work.
> > > I agree. I've already noticed Andrea's patches but not read through all the
> > > patches yet, sorry.
> > > 
> > > One concern: isn't there any inconsistency to handle css->refcnt in charging/uncharging
> > > compound pages the same way as a normal page ?
> > > 
> > AKAIK, no inconsistency.
> O.K. thanks.
> (It might be better for us to remove per page css refcnt till 2.6.34...)
> 
Hmm, if I understand these patches correctly, some inconsistency about css->refcnt
and page_cgroup of tail pages happen when a huge page is splitted.
At least, I think pc->flags and pc->mem_cgroup of them should be handled.

So, I think we need some hooks in __split_huge_page_map() or some tricks.

> > My biggest concern is that page-table-walker has to handle hugepages. 
> > 
> Ah, you're right.
> It would be a big change..
> 
In [19/28] of this version, split_huge_page_mm() is called in walk_pmd_range().
So, I think it will work w/o changing current code.
(It might be better to change my code, which does all the works in walk->pmd_entry(),
to prevent unnecessary splitting.)


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 28 of 28] memcg huge memory
  2009-12-25  4:17                   ` Daisuke Nishimura
@ 2009-12-25  4:37                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 89+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-25  4:37 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Andi Kleen, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	Andrew Morton

On Fri, 25 Dec 2009 13:17:00 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 21 Dec 2009 13:33:15 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > On Mon, 21 Dec 2009 12:52:23 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 21 Dec 2009 10:24:27 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > On Mon, 21 Dec 2009 09:26:25 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > Added CC: to Nishimura.
> > > > > 
> > > > > Andrea, Please go ahead as you like. My only concern is a confliction with
> > > > > Nishimura's work.
> > > > I agree. I've already noticed Andrea's patches but not read through all the
> > > > patches yet, sorry.
> > > > 
> > > > One concern: isn't there any inconsistency to handle css->refcnt in charging/uncharging
> > > > compound pages the same way as a normal page ?
> > > > 
> > > AKAIK, no inconsistency.
> > O.K. thanks.
> > (It might be better for us to remove per page css refcnt till 2.6.34...)
> > 
> Hmm, if I understand these patches correctly, some inconsistency about css->refcnt
> and page_cgroup of tail pages happen when a huge page is splitted.
> At least, I think pc->flags and pc->mem_cgroup of them should be handled.
> 
> So, I think we need some hooks in __split_huge_page_map() or some tricks.
> 
Ah, yes.

> > > My biggest concern is that page-table-walker has to handle hugepages. 
> > > 
> > Ah, you're right.
> > It would be a big change..
> > 
> In [19/28] of this version, split_huge_page_mm() is called in walk_pmd_range().
> So, I think it will work w/o changing current code.
> (It might be better to change my code, which does all the works in walk->pmd_entry(),
> to prevent unnecessary splitting.)
> 

Ok, thank you.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-23  0:06         ` Andrea Arcangeli
  2009-12-23  6:09           ` Paul Mundt
@ 2010-01-03 18:38           ` Mel Gorman
  2010-01-04 15:49             ` Andrea Arcangeli
  2010-01-04 16:58             ` Christoph Lameter
  1 sibling, 2 replies; 89+ messages in thread
From: Mel Gorman @ 2010-01-03 18:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton, Paul Mundt

On Wed, Dec 23, 2009 at 01:06:40AM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 21, 2009 at 08:31:50PM +0000, Mel Gorman wrote:
> > My vague worry is that multiple huge page sizes are currently supported in
> > hugetlbfs but transparent support is obviously tied to the page-table level
> > it's implemented for. In the future, the term "huge" could be ambiguous . How
> > about instead of things like HUGE_MASK, it would be HUGE_PMD_MASK? It's not
> > something I feel very strongly about as eventually I'll remember what sort of
> > "huge" is meant in each context.
> 
> Ok this naming seems to be a little troublesome. HUGE_PMD_MASK would
> then require HUGE_PMD_SIZE. That is confusing a little to me, that is
> the size of the page not of the pmd... Maybe HPAGE_PMD_SIZE is better?

HPAGE_PMD_SIZE is better

> Overall this is just one #define and search and replace, I can do that
> if people likes it more than HPAGE_SIZE.
> 
> > /*
> >  * Currently uses  __GFP_REPEAT during allocation. Should be implemented
> >  * using page migration in the future
> >  */
> 
> Done! thanks.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -75,6 +75,11 @@ static ssize_t enabled_store(struct kobj
>  static struct kobj_attribute enabled_attr =
>  	__ATTR(enabled, 0644, enabled_show, enabled_store);
>  
> +/*
> + * Currently uses __GFP_REPEAT during allocation. Should be
> + * implemented using page migration and real defrag algorithms in
> + * future VM.
> + */
>  static ssize_t defrag_show(struct kobject *kobj,
>  			   struct kobj_attribute *attr, char *buf)
>  {
> 
> > do_huge_pmd_anonymous_page makes sense.
> 
> Agreed, I already changed all methods called from memory.c to
> huge_memory.c with a "huge_pmd" prefix instead of just "huge".
> 
> > IA-64 can't in its currently implementation. Due to the page table format
> > they use, huge pages can only be mapped at specific ranges in the virtual
> > address space. If the long-format version of the page table was used, they
> 
> Hmm ok, so it sounds like hugetlbfs limitations are a software feature
> for ia64 too.
> 

It's not hugetlbfs that is the problem, it's the page table format
itself. There is a more flexible flexible long-form pagetable format
available on the hardware but Linux doesn't use it.

In theory, you could implement transparent support on IA-64 without
disabling the short-form pagetable format by disabling the hardware
pagetable walker altogether and handling TLB misses in software but it
would likely be an overall loss.

> > would be able to but I bet it's not happening any time soon. The best bet
> > for other architectures supporting this would be sparc and maybe sh.
> > It might be worth poking Paul Mundt in particular because he expressed
> > an interest in transparent support of some sort in the past for sh.
> 
> I added him to CC.
> 
> > Because huge pages cannot move. If the MOVABLE zone has been set up to
> > guarantee memory hot-plug removal, they don't want huge pages to be
> > getting in the way. To allow unconditional use of GFP_HIGHUSER_MOVABLE,
> > memory hotplug would have to know it can demote all the transparent huge
> > pages and migrate them that way.
> 
> It should already do. migrate.c calls try_to_unmap that will split
> them and migrate them just fine. If they can't be migrated I will
> remove GFP_HIGHUSER_MOVABLE but I think they can already. migrate.c
> can't notice the difference.
> 

Ok, if it is a case that the huge pages get demoted and migrated, then
the use of GFP_HIGHUSER_MOVABLE is not a problem.

> > My preference would be to move the alloc_mask into common code or at
> > least make it available via mm/internal.h because otherwise this will
> > collide with memory hot-remove in the future.
> 
> We can do that. But what I don't understand is why do_anonymous_page
> ses an unconditional GFP_HIGHUSER_MOVABLE.

Because it can be migrated.

> If there's no benefit to
> do_anonymous_page to turn off the gfp movable flag, I don't see why it
> could be beneficial to turn it off on hugepages.

There is no benefit in turning of the gfp movable flag. The presense of
the flag allows the use of ZONE_MOVABLE i.e. there is more physical
memory that can be potentially used.

> If there's good
> reason for that we surely can make it conditional into common code. I
> didn't look too hard for it, but what is the reason there is this flag
> in hugetlbfs?
> 

hugetlbfs does not use the flag by default because its pages cannot be migrated
(it could be implemented of course, but it hasn't been to date). The flag is
conditionally used because ZONE_MOVABLE can be used to almost guarantee that
X number of hugepages can always be allocated regardless of the fragmentation
state of the system. It's an "almost" guarantee because we do not have memory
defragmentation to move mlocked pages.

> > I would prefer pmd to be added to the huge names. However, this was
> > mostly to aid comprehension of the patchset when I was taking a quick
> 
> That is neutral to me... it's just that HPAGE_SIZE already existed so
> I tried to avoid adding unnecessary things but I'm not against
> HPAGE_PMD_SIZE, that will make it more clearer this is the size of a
> hugepage mapped by a pmd (and not a gigapage mapped by pud).
> 

Agreed.

> Thanks for the help! (we'll need more of your help in the defrag area
> too according to comment added above ;)
> 

I prototyped memory deframentation ages ago. It worked for the most case
but has bit-rotted significantly. I really should dig it out from
whatever hole I left it in.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2009-12-17 19:00 ` [PATCH 25 of 28] transparent hugepage core Andrea Arcangeli
  2009-12-18 20:03   ` Mel Gorman
@ 2010-01-04  6:16   ` Daisuke Nishimura
  2010-01-04 16:04     ` Andrea Arcangeli
  1 sibling, 1 reply; 89+ messages in thread
From: Daisuke Nishimura @ 2010-01-04  6:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton, Daisuke Nishimura

Hi.

> +static int __do_huge_anonymous_page(struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
> +				    unsigned long address, pmd_t *pmd,
> +				    struct page *page,
> +				    unsigned long haddr)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, address);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_NR);
> +
> +	__SetPageUptodate(page);
> +	smp_wmb();
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +	
> +	return ret;
> +}
> +
IIUC, page_add_new_anon_rmap()(and add_page_to_lru_list(), which will be called
by the call path) will update zone state of NR_ANON_PAGES and NR_ACTIVE_ANON.
Shouldn't we also modify zone state codes to support transparent hugepage support ?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2010-01-03 18:38           ` Mel Gorman
@ 2010-01-04 15:49             ` Andrea Arcangeli
  2010-01-04 16:58             ` Christoph Lameter
  1 sibling, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2010-01-04 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Andi Kleen, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton, Paul Mundt

On Sun, Jan 03, 2010 at 06:38:03PM +0000, Mel Gorman wrote:
> HPAGE_PMD_SIZE is better

Ok I converted the whole patchset after adding:

#define HPAGE_PMD_SHIFT HPAGE_SHIFT
#define HPAGE_PMD_MASK HPAGE_MASK
#define HPAGE_PMD_SIZE HPAGE_SIZE

to huge_mm.h.

> Ok, if it is a case that the huge pages get demoted and migrated, then
> the use of GFP_HIGHUSER_MOVABLE is not a problem.

Yes they're identical to regular pages, this is the whole point of
transparency. So I'll keep only MOVABLE.

> There is no benefit in turning of the gfp movable flag. The presense of

Agreed.

> I prototyped memory deframentation ages ago. It worked for the most case
> but has bit-rotted significantly. I really should dig it out from
> whatever hole I left it in.

You really should. Luckily despite the code move heavily the internal
design is about to identical, so you will have to rewrite but
algorithms won't need to change substantially, except to handle all
those new features and more tedious accounting than before.

Marcelo also had a patch in defrag area. Also khugepaged is now defrag
unaware, that means it will only wait on new hugepages to be added to
the freelist. But it won't create it itself. That should change but
until there's no real defrag algorithm I don't want to waste cpu in a
not-targeted way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2010-01-04  6:16   ` Daisuke Nishimura
@ 2010-01-04 16:04     ` Andrea Arcangeli
  0 siblings, 0 replies; 89+ messages in thread
From: Andrea Arcangeli @ 2010-01-04 16:04 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Andi Kleen,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	Andrew Morton

Hi,

On Mon, Jan 04, 2010 at 03:16:49PM +0900, Daisuke Nishimura wrote:
> IIUC, page_add_new_anon_rmap()(and add_page_to_lru_list(), which will be called
> by the call path) will update zone state of NR_ANON_PAGES and NR_ACTIVE_ANON.
> Shouldn't we also modify zone state codes to support transparent hugepage support ?

Correct. I did more changes in the last weeks besides the work on
khugepaged. This is the relevant one that you couldn't see and that
already takes care of the above. Maybe I should send a new update for
this and other bits even if the last bit of khugepaged isn't working
yet. Otherwise wait a little more and get the whole thing working. Let
me know. This is combined with other changes to the split logic that
now transfers the single hugeanonpage to 512 anonpages.

----
Subject: transparent hugepage vmstat

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -151,6 +154,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	NR_ANON_TRANSPARENT_HUGEPAGES,
+#endif
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,10 @@ static void __split_huge_page_refcount(s
 		put_page(page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+			      HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,8 +692,13 @@ void page_add_anon_rmap(struct page *pag
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	VM_BUG_ON(PageTail(page));
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageCompound(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -722,7 +727,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(PageTail(page));
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageCompound(page))
+	    __inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+	    __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -770,7 +778,11 @@ void page_remove_rmap(struct page *page)
 	}
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageCompound(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -655,6 +655,9 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"nr_anon_transparent_hugepages",
+#endif
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 25 of 28] transparent hugepage core
  2010-01-03 18:38           ` Mel Gorman
  2010-01-04 15:49             ` Andrea Arcangeli
@ 2010-01-04 16:58             ` Christoph Lameter
  1 sibling, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2010-01-04 16:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Andi Kleen, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Chris Wright, Andrew Morton,
	Paul Mundt

On Sun, 3 Jan 2010, Mel Gorman wrote:

> I prototyped memory deframentation ages ago. It worked for the most case
> but has bit-rotted significantly. I really should dig it out from
> whatever hole I left it in.

Yes please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2010-01-04 17:00 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-17 19:00 [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 01 of 28] compound_lock Andrea Arcangeli
2009-12-17 19:46   ` Christoph Lameter
2009-12-18 14:27     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 02 of 28] alter compound get_page/put_page Andrea Arcangeli
2009-12-17 19:50   ` Christoph Lameter
2009-12-18 14:30     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 03 of 28] clear compound mapping Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 04 of 28] add native_set_pmd_at Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 05 of 28] add pmd paravirt ops Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 06 of 28] no paravirt version of pmd ops Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 07 of 28] export maybe_mkwrite Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 08 of 28] comment reminder in destroy_compound_page Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 09 of 28] config_transparent_hugepage Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 10 of 28] add pmd mangling functions to x86 Andrea Arcangeli
2009-12-18 18:56   ` Mel Gorman
2009-12-19 15:27     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 11 of 28] add pmd mangling generic functions Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 12 of 28] special pmd_trans_* functions Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 13 of 28] bail out gup_fast on freezed pmd Andrea Arcangeli
2009-12-18 18:59   ` Mel Gorman
2009-12-19 15:48     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 14 of 28] pte alloc trans splitting Andrea Arcangeli
2009-12-18 19:03   ` Mel Gorman
2009-12-19 15:59     ` Andrea Arcangeli
2009-12-21 19:57       ` Mel Gorman
2009-12-17 19:00 ` [PATCH 15 of 28] add pmd mmu_notifier helpers Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 16 of 28] clear page compound Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 17 of 28] add pmd_huge_pte to mm_struct Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 18 of 28] ensure mapcount is taken on head pages Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 19 of 28] split_huge_page_mm/vma Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 20 of 28] split_huge_page paging Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 21 of 28] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 22 of 28] clear_huge_page fix Andrea Arcangeli
2009-12-18 19:16   ` Mel Gorman
2009-12-17 19:00 ` [PATCH 23 of 28] clear_copy_huge_page Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 24 of 28] kvm mmu transparent hugepage support Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 25 of 28] transparent hugepage core Andrea Arcangeli
2009-12-18 20:03   ` Mel Gorman
2009-12-19 16:41     ` Andrea Arcangeli
2009-12-21 20:31       ` Mel Gorman
2009-12-23  0:06         ` Andrea Arcangeli
2009-12-23  6:09           ` Paul Mundt
2010-01-03 18:38           ` Mel Gorman
2010-01-04 15:49             ` Andrea Arcangeli
2010-01-04 16:58             ` Christoph Lameter
2010-01-04  6:16   ` Daisuke Nishimura
2010-01-04 16:04     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 26 of 28] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 27 of 28] memcg compound Andrea Arcangeli
2009-12-18  1:27   ` KAMEZAWA Hiroyuki
2009-12-18 16:02     ` Andrea Arcangeli
2009-12-17 19:00 ` [PATCH 28 of 28] memcg huge memory Andrea Arcangeli
2009-12-18  1:33   ` KAMEZAWA Hiroyuki
2009-12-18 16:04     ` Andrea Arcangeli
2009-12-18 23:06       ` KAMEZAWA Hiroyuki
2009-12-20 18:39         ` Andrea Arcangeli
2009-12-21  0:26           ` KAMEZAWA Hiroyuki
2009-12-21  1:24             ` Daisuke Nishimura
2009-12-21  3:52               ` KAMEZAWA Hiroyuki
2009-12-21  4:33                 ` Daisuke Nishimura
2009-12-25  4:17                   ` Daisuke Nishimura
2009-12-25  4:37                     ` KAMEZAWA Hiroyuki
2009-12-24 10:00   ` Balbir Singh
2009-12-24 11:40     ` Andrea Arcangeli
2009-12-24 12:07       ` Balbir Singh
2009-12-17 19:54 ` [PATCH 00 of 28] Transparent Hugepage support #2 Christoph Lameter
2009-12-17 19:58   ` Rik van Riel
2009-12-17 20:09     ` Christoph Lameter
2009-12-18  5:12       ` Ingo Molnar
2009-12-18  6:18         ` KOSAKI Motohiro
2009-12-18 18:28         ` Christoph Lameter
2009-12-18 18:41           ` Dave Hansen
2009-12-18 19:17             ` Mike Travis
2009-12-18 19:28               ` Swap on flash SSDs Dave Hansen
2009-12-18 19:38                 ` Andi Kleen
2009-12-18 19:39                 ` Ingo Molnar
2009-12-18 20:13                   ` Linus Torvalds
2009-12-18 20:31                     ` Ingo Molnar
2009-12-19 18:38                   ` Jörn Engel
2009-12-18 14:05       ` [PATCH 00 of 28] Transparent Hugepage support #2 Andrea Arcangeli
2009-12-18 18:33         ` Christoph Lameter
2009-12-19 15:09           ` Andrea Arcangeli
2009-12-17 20:47     ` Mike Travis
2009-12-18  3:28       ` Rik van Riel
2009-12-18 14:12       ` Andrea Arcangeli
2009-12-18 12:52     ` Avi Kivity
2009-12-18 18:47 ` Dave Hansen
2009-12-19 15:20   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).