All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00 of 41] Transparent Hugepage Support #17
@ 2010-04-02  0:41 Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
                   ` (41 more replies)
  0 siblings, 42 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

Hello,

With some heavy forking and split_huge_page stressing testcase, I found a
slight problem probably made visible by the anon_vma_chain: during the
anon_vma walk of __split_huge_page_splitting, page_check_address_pmd run in a
pmd that had the splitting bit set. The splitting but was set by a previously
forked process calling split_huge_page on its private page belonging to the
child anon_vma. The parent still has visiblity on the vma of the child so the
rmap walk of the parent covers the child too, but the split of the child page
can happen in parallel now. This triggered a VM_BUG_ON false positive and it
was enough to move the check on the page above the check to fix it. (it would
not have been noticeable with CONFIG_DEBUG_VM=n). All runs back flawless now
with the debug turned on.

@@ -1109,9 +1109,11 @@ new file mode 100644
 +	pmd = pmd_offset(pud, address);
 +	if (pmd_none(*pmd))
 +		goto out;
++	if (pmd_page(*pmd) != page)
++		goto out;
 +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
 +		  pmd_trans_splitting(*pmd));
-+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
++	if (pmd_trans_huge(*pmd)) {
 +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
 +			  !pmd_trans_splitting(*pmd));
 +		ret = pmd;

Then there was one more issues while testing ksm and khugepaged co-existing and
mergeing and collapsing pages on the same vma simultanously (which works fine
now in #17). One check for PageTransCompound was missing in ksm and another
had to be converted from PageTransHuge to PageTransCompound.

This also has the fixed version of the remove-PG_buddy patch, that moves
memory_hotplug bootmem typing code to use page->lru.next with a proper enum to
freeup mapcount -2 for PG_buddy semantics.

Not included by email but available in the directory there is the
latest version of the ksm-swapcache fix (waiting a comment from Hugh to
deliver it separately).

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-17/
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.34-rc2-mm1/transparent_hugepage-17.gz

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 01 of 41] define MADV_HUGEPAGE
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -53,6 +53,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -77,6 +77,8 @@
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 #define MADV_HWPOISON    100		/* poison a page for testing */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -59,6 +59,8 @@
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -83,6 +83,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,7 +45,7 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
-#define MADV_HUGEPAGE	15		/* Worth backing with hugepages */
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
 
 /* compatibility flags */
 #define MAP_FILE	0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 02 of 41] compound_lock
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/range.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -297,6 +298,20 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_lock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_unlock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+#endif
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,9 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	PG_compound_lock,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -399,6 +402,12 @@ static inline void __ClearPageTail(struc
 #define __PG_MLOCKED		0
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
+#else
+#define __PG_COMPOUND_LOCK		0
+#endif
+
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -408,7 +417,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_COMPOUND_LOCK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 03 of 41] alter compound get_page/put_page
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,6 +16,16 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
 			put_page(page);
 			return 0;
 		}
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,6 +105,16 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -326,9 +326,17 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,82 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_cold_page(page, 0);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+				VM_BUG_ON(PageHead(page_head));
+			out_put_head:
+				if (put_page_testzero(page_head))
+					__put_single_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			compound_unlock(page_head);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+		} else {
+			/* page_head is a dangling pointer */
+			VM_BUG_ON(PageTail(page));
+			goto out_put_single;
+		}
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +139,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 04 of 41] update futex compound knowledge
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
                   ` (37 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Futex code is smarter than most other gup_fast O_DIRECT code and knows about
the compound internals. However now doing a put_page(head_page) will not
release the pin on the tail page taken by gup-fast, leading to all sort of
refcounting bugchecks. Getting a stable head_page is a little tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called, otherwise
it's guaranteed unnecessary. And if it's a tail page compound_head has to run
atomically inside irq disabled section __get_user_pages_fast before returning.
Otherwise ->first_page won't be a stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it means
the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn when we
run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are always safe
to call compound_lock and after taking the compound lock on head page we can
finally re-check if the page returned by gup-fast is still a tail page. in
which case we're set and we didn't need to split the hugepage in order to take
a futex on it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -218,7 +218,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page;
+	struct page *page, *page_head;
 	int err;
 
 	/*
@@ -250,10 +250,53 @@ again:
 	if (err < 0)
 		return err;
 
-	page = compound_head(page);
-	lock_page(page);
-	if (!page->mapping) {
-		unlock_page(page);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page_head = page;
+	if (unlikely(PageTail(page))) {
+		put_page(page);
+		/* serialize against __split_huge_page_splitting() */
+		local_irq_disable();
+		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head)
+				get_page(page_head);
+			local_irq_enable();
+		} else {
+			local_irq_enable();
+			goto again;
+		}
+	}
+#else
+	page_head = compound_head(page);
+	if (page != page_head)
+		get_page(page_head);
+#endif
+
+	lock_page(page_head);
+	if (unlikely(page_head != page)) {
+		compound_lock(page_head);
+		if (unlikely(!PageTail(page))) {
+			compound_unlock(page_head);
+			unlock_page(page_head);
+			put_page(page_head);
+			put_page(page);
+			goto again;
+		}
+	}
+	if (!page_head->mapping) {
+		unlock_page(page_head);
+		if (page_head != page)
+			put_page(page_head);
 		put_page(page);
 		goto again;
 	}
@@ -265,19 +308,25 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page_head)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page->mapping->host;
-		key->shared.pgoff = page->index;
+		key->shared.inode = page_head->mapping->host;
+		key->shared.pgoff = page_head->index;
 	}
 
 	get_futex_key_refs(key);
 
-	unlock_page(page);
+	unlock_page(page_head);
+	if (page != page_head) {
+		VM_BUG_ON(!PageTail(page));
+		/* releasing compound_lock after page_lock won't matter */
+		compound_unlock(page_head);
+		put_page(page_head);
+	}
 	put_page(page);
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 05 of 41] fix bad_page to show the real reason the page is bad
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

page_count shows the count of the head page, but the actual check is done on
the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5291,7 +5291,7 @@ void dump_page(struct page *page)
 {
 	printk(KERN_ALERT
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
-		page, page_count(page), page_mapcount(page),
+		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
 	dump_page_flags(page->flags);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 06 of 41] clear compound mapping
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -629,6 +629,8 @@ static void __free_pages_ok(struct page 
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 07 of 41] add native_set_pmd_at
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -528,6 +528,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 08 of 41] add pmd paravirt ops
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -440,6 +440,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -447,6 +452,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -548,6 +559,18 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+#endif
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 09 of 41] no paravirt version of pmd ops
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -33,6 +33,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -57,6 +58,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 10 of 41] export maybe_mkwrite
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -390,6 +390,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2031,19 +2031,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 11 of 41] comment reminder in destroy_compound_page
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -334,6 +334,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 12 of 41] config_transparent_hugepage
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -287,3 +287,17 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support" if EMBEDDED
+	depends on X86_64
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 13 of 41] special pmd_trans_* functions
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
 #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
 
 #define __HAVE_ARCH_PTE_SAME
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -344,6 +344,11 @@ extern void untrack_pfn_vma(struct vm_ar
 				unsigned long size);
 #endif
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_trans_huge(pmd) 0
+#define pmd_trans_splitting(pmd) 0
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 14 of 41] add pmd mangling generic functions
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -25,6 +25,26 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);		\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_PMD_SIZE);	\
+		}							\
+		__changed;						\
+	})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young(__vma, __address, __ptep)		\
 ({									\
@@ -39,6 +59,25 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(__vma, __address, __ptep)		\
 ({									\
@@ -50,6 +89,24 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_PMD_SIZE);		\
+	__young;							\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
@@ -59,6 +116,20 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_get_and_clear(__mm, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
 ({									\
@@ -90,6 +161,22 @@ do {									\
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
@@ -99,10 +186,45 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_wrprotect(mm, address, pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_splitting_flush(__vma, __address, __pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
 #endif
 
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmd_same(A,B)	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
 #define page_test_dirty(page)		(0)
 #endif
@@ -347,6 +469,9 @@ extern void untrack_pfn_vma(struct vm_ar
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd) 0
 #define pmd_trans_splitting(pmd) 0
+#ifndef __HAVE_ARCH_PMD_WRITE
+#define pmd_write(pmd)	({ BUG(); 0; })
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 15 of 41] add pmd mangling functions to x86
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -300,15 +300,15 @@ pmd_t *populate_extra_pmd(unsigned long 
 pte_t *populate_extra_pte(unsigned long vaddr);
 #endif	/* __ASSEMBLY__ */
 
+#ifndef __ASSEMBLY__
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_X86_32
 # include "pgtable_32.h"
 #else
 # include "pgtable_64.h"
 #endif
 
-#ifndef __ASSEMBLY__
-#include <linux/mm_types.h>
-
 static inline int pte_none(pte_t pte)
 {
 	return !pte.pte;
@@ -351,7 +351,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -72,6 +72,19 @@ static inline pte_t native_ptep_get_and_
 #endif
 }
 
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(NULL, 0, xp);
+	return ret;
+#endif
+}
+
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	*pmdp = pmd;
@@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmdp);
+}
+
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -309,6 +309,25 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+
+	return changed;
+}
+#endif
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -324,6 +343,23 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+#endif
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -336,6 +372,36 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 16 of 41] bail out gup_fast on splitting pmd
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
                   ` (25 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is splitting, not
only if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -160,7 +160,18 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 17 of 41] pte alloc trans splitting
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1067,7 +1067,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1136,12 +1137,14 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -396,9 +396,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -418,14 +420,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -438,10 +444,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3119,7 +3126,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -48,7 +48,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -63,7 +64,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -148,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 18 of 41] add pmd mmu_notifier helpers
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 19 of 41] clear page compound
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -349,7 +349,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -357,6 +357,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -394,6 +401,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
+	clear_bit(PG_compound, &page->flags);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 20 of 41] add pmd_huge_pte to mm_struct
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,6 +310,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -522,6 +522,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -662,6 +665,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 21 of 41] split_huge_page_mm/vma
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_pmd compat code. Each one of those would need to be expanded to
hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -443,6 +443,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,6 +154,7 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -89,6 +89,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,6 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(walk->mm, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 22 of 41] split_huge_page paging
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan, but
for khugepaged to be safe, I'm relying on either mmap_sem write mode, or
PG_lock taken, so split_huge_page has to run either with mmap_sem read/write
mode or PG_lock taken. Calling it from isolate_lru_page would make locking more
complicated, in addition to that split_huge_page would deadlock if called by
__isolate_lru_page because it has to take the lru lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -378,6 +378,8 @@ static void collect_procs_anon(struct pa
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
+	if (unlikely(split_huge_page(page)))
+		return;
 	read_lock(&tasklist_lock);
 	av = page_lock_anon_vma(page);
 	if (av == NULL)	/* Not actually mapped anymore */
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1284,6 +1284,7 @@ int try_to_unmap(struct page *page, enum
 	int ret;
 
 	BUG_ON(!PageLocked(page));
+	BUG_ON(PageTransHuge(page));
 
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -156,6 +156,12 @@ int add_to_swap(struct page *page)
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -937,6 +937,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 23 of 41] clear_copy_huge_page
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1507,5 +1507,14 @@ extern int soft_offline_page(struct page
 
 extern void dump_page(struct page *page);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_huge_page(struct page *dst, struct page *src,
+			   unsigned long addr, struct vm_area_struct *vma,
+			   unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -385,70 +385,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-static void copy_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
-
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2333,7 +2269,8 @@ retry_avoidcopy:
 		return -PTR_ERR(new_page);
 	}
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_huge_page(new_page, old_page, address, vma,
+		       pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
@@ -2429,7 +2366,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address, huge_page_size(h));
+		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3495,3 +3495,73 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_gigantic_page(struct page *dst, struct page *src,
+			       unsigned long addr,
+			       struct vm_area_struct *vma,
+			       unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+void copy_huge_page(struct page *dst, struct page *src,
+		    unsigned long addr, struct vm_area_struct *vma,
+		    unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src, addr, vma, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE,
+				   vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 24 of 41] kvm mmu transparent hugepage support
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -470,6 +470,15 @@ static int host_mapping_level(struct kvm
 
 	page_size = kvm_host_page_size(kvm, gfn);
 
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE) {
+		struct page *page = gfn_to_page(kvm, gfn);
+
+		if (!is_error_page(page) && PageTransCompound(page))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_page_clean(page);
+	}
+
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
 		if (page_size >= KVM_HPAGE_SIZE(i))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 25 of 41] _GFP_NO_KSWAPD
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -60,13 +60,15 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif
 
+#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1867,7 +1867,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
                   ` (15 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1811,7 +1811,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	/*
+	 * Not worth trying to allocate harder for __GFP_NOMEMALLOC
+	 * even if it can't schedule.
+	 */
+	if (!wait && !(gfp_mask & __GFP_NOMEMALLOC)) {
 		alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 27 of 41] transparent hugepage core
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
                   ` (14 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
* * *
adapt to mm_counter in -mm

From: Andrea Arcangeli <aarcange@redhat.com>

The interface changed slightly.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -87,6 +87,9 @@ struct vm_area_struct;
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
+#define GFP_TRANSHUGE	(__GFP_HARDWALL | __GFP_HIGHMEM |		\
+			 __GFP_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | \
+			 __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,126 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)				\
+	((transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern int split_huge_page(struct page *page);
+extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
+#define split_huge_page_pmd(__mm, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (unlikely(pmd_trans_huge(*____pmd)))			\
+			__split_huge_page_pmd(__mm, ____pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		/*							\
+		 * spin_unlock_wait() is just a loop in C and so the	\
+		 * CPU can reorder anything around it.			\
+		 */							\
+		smp_mb();						\
+		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
+		       pmd_trans_huge(*____pmd));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
+#define HPAGE_PMD_MASK ({ BUG(); 0; })
+#define HPAGE_PMD_SIZE ({ BUG(); 0; })
+
+#define transparent_hugepage_enabled(__vma) 0
+
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_pmd(__mm, __pmd)	\
+	do { } while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -107,6 +107,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -235,6 +238,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,6 +205,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,867 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
+ * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
+ * memory just to allocate one more hugepage.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init hugepage_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(hugepage_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	if (!str)
+		return 0;
+	transparent_hugepage_flags = simple_strtoul(str, &str, 0);
+	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags) &&
+	    test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		     &transparent_hugepage_flags)) {
+		printk(KERN_WARNING
+		       "transparent_hugepage=%lu invalid parameter, disabling",
+		       transparent_hugepage_flags);
+		transparent_hugepage_flags = 0;
+	}
+	return 1;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+static int __init setup_no_transparent_hugepage(char *str)
+{
+	transparent_hugepage_flags = 0;
+	return 1;
+}
+__setup("no_transparent_hugepage", setup_no_transparent_hugepage);
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long haddr, pmd_t *pmd,
+					struct page *page)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		/*
+		 * The spinlocking to take the lru_lock inside
+		 * page_add_new_anon_rmap() acts as a full memory
+		 * barrier to be sure clear_huge_page writes become
+		 * visible after the set_pmd_at() write.
+		 */
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_TRANSHUGE | (defrag ? __GFP_WAIT : 0),
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	else
+		put_page(page);
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (transparent_hugepage_enabled(vma) &&
+	    !transparent_hugepage_debug_cow())
+		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	else
+		new_page = NULL;
+
+	if (unlikely(!new_page)) {
+		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+						   pmd, orig_pmd, page, haddr);
+		goto out;
+	}
+
+	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			spin_unlock(&tlb->mm->page_table_lock);
+			VM_BUG_ON(!PageHead(page));
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_page(*pmd) != page)
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd)) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		/*
+		 * Up to this point the pmd is present and huge and
+		 * userland has the whole access to the hugepage
+		 * during the split (which happens in place). If we
+		 * overwrite the pmd with the not-huge version
+		 * pointing to the pte here (which of course we could
+		 * if all CPUs were bug free), userland could trigger
+		 * a small page size TLB miss on the small sized TLB
+		 * while the hugepage TLB entry is still established
+		 * in the huge TLB. Some CPU doesn't like that. See
+		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
+		 * Erratum 383 on page 93. Intel should be safe but is
+		 * also warns that it's only safe if the permission
+		 * and cache attributes of the two entries loaded in
+		 * the two TLB is identical (which should be the case
+		 * here). But it is generally safer to never allow
+		 * small and huge TLB entries for the same virtual
+		 * address to be loaded simultaneously. So instead of
+		 * doing "pmd_populate(); flush_tlb_range();" we first
+		 * mark the current pmd notpresent (atomically because
+		 * here the pmd_trans_huge and pmd_trans_splitting
+		 * must remain set at all times on the pmd until the
+		 * split is complete for this pmd), then we flush the
+		 * SMP TLB and finally we write the non-huge version
+		 * of the pmd entry with pmd_populate.
+		 */
+		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmd_populate(mm, pmd, pgtable);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct anon_vma_chain *avc;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
+
+void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!page_count(page));
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	/*
+	 * The vma->anon_vma->lock is the wrong lock if the page is shared,
+	 * the anon_vma->lock pointed by page->mapping is the right one.
+	 */
+	split_huge_page(page);
+
+	put_page(page);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -728,9 +728,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -804,6 +804,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1006,6 +1016,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1273,11 +1292,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -3045,9 +3080,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3126,6 +3161,22 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
 	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -318,7 +319,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -432,35 +433,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -469,9 +452,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -461,6 +461,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 28 of 41] verify pmd_trans_huge isn't leaking
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1421,6 +1421,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -1622,8 +1623,10 @@ pte_t *get_locked_pte(struct mm_struct *
 	pud_t * pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
 		pmd_t * pmd = pmd_alloc(mm, pud, addr);
-		if (pmd)
+		if (pmd) {
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			return pte_alloc_map_lock(mm, pmd, addr, ptl);
+		}
 	}
 	return NULL;
 }
@@ -1842,6 +1845,7 @@ static inline int remap_pmd_range(struct
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
 		if (remap_pte_range(mm, pmd, addr, next,
@@ -3317,6 +3321,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 29 of 41] madvise(MADV_HUGEPAGE)
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -121,6 +122,11 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -842,6 +842,22 @@ out:
 	return ret;
 }
 
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 30 of 41] pmd_trans_huge migrate bugcheck
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -105,6 +105,10 @@ static inline int PageTransHuge(struct p
 	VM_BUG_ON(PageTail(page));
 	return PageHead(page);
 }
+static inline int PageTransCompound(struct page *page)
+{
+	return PageCompound(page);
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
@@ -122,6 +126,7 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+#define PageTransCompound(page) 0
 static inline int hugepage_madvise(unsigned long *vm_flags)
 {
 	BUG_ON(0);
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -94,6 +94,7 @@ static int remove_migration_pte(struct p
 		goto out;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		goto out;
 
@@ -810,6 +811,10 @@ static int do_move_page_to_node_array(st
 		if (PageReserved(page) || PageKsm(page))
 			goto put_and_set;
 
+		if (unlikely(PageTransCompound(page)))
+			if (unlikely(split_huge_page(page)))
+				goto put_and_set;
+
 		pp->page = page;
 		err = page_to_nid(page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 31 of 41] memcg compound
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:41 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -4,6 +4,10 @@ NOTE: The Memory Resource Controller has
 to as the memory controller in this document. Do not confuse memory controller
 used here with the memory controller that is used in hardware.
 
+NOTE: When in this documentation we refer to PAGE_SIZE, we actually
+mean the real page size of the page being accounted which is bigger than
+PAGE_SIZE for compound pages.
+
 Salient features
 
 a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1577,12 +1577,14 @@ static int __cpuinit memcg_stock_cpu_cal
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+				   gfp_t gfp_mask,
+				   struct mem_cgroup **memcg, bool oom,
+				   int page_size)
 {
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
-	int csize = CHARGE_SIZE;
+	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1617,8 +1619,9 @@ static int __mem_cgroup_try_charge(struc
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (consume_stock(mem))
-			goto done;
+		if (page_size == PAGE_SIZE)
+			if (consume_stock(mem))
+				goto done;
 
 		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
@@ -1638,8 +1641,8 @@ static int __mem_cgroup_try_charge(struc
 									res);
 
 		/* reduce request size and retry */
-		if (csize > PAGE_SIZE) {
-			csize = PAGE_SIZE;
+		if (csize > page_size) {
+			csize = page_size;
 			continue;
 		}
 		if (!(gfp_mask & __GFP_WAIT))
@@ -1715,8 +1718,10 @@ static int __mem_cgroup_try_charge(struc
 			goto bypass;
 		}
 	}
-	if (csize > PAGE_SIZE)
-		refill_stock(mem, csize - PAGE_SIZE);
+	if (csize > page_size)
+		refill_stock(mem, csize - page_size);
+	if (page_size != PAGE_SIZE)
+		__css_get(&mem->css, page_size);
 done:
 	return 0;
 nomem:
@@ -1746,9 +1751,10 @@ static void __mem_cgroup_cancel_charge(s
 	/* we don't need css_put for root */
 }
 
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
+				     int page_size)
 {
-	__mem_cgroup_cancel_charge(mem, 1);
+	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
 }
 
 /*
@@ -1804,8 +1810,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -1814,7 +1821,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -1891,7 +1898,7 @@ static void __mem_cgroup_move_account(st
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from);
+		mem_cgroup_cancel_charge(from, PAGE_SIZE);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -1952,13 +1959,14 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
+				      PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
 	ret = mem_cgroup_move_account(pc, child, parent, true);
 	if (ret)
-		mem_cgroup_cancel_charge(parent);
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
 put_back:
 	putback_lru_page(page);
 put:
@@ -1980,6 +1988,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -1988,11 +2000,11 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -2001,8 +2013,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -2015,7 +2025,7 @@ int mem_cgroup_newpage_charge(struct pag
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+					MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
 static void
@@ -2108,14 +2118,14 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
 	/* drop extra refcnt from tryget */
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
 }
 
 static void
@@ -2131,7 +2141,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -2180,11 +2190,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -2219,14 +2230,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	if (unlikely(batch->memcg != mem))
 		memcg_oom_recover(mem);
 	return;
@@ -2241,6 +2252,10 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2280,7 +2295,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -2506,7 +2521,8 @@ int mem_cgroup_prepare_migration(struct 
 	unlock_page_cgroup(pc);
 
 	if (mem) {
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+					      PAGE_SIZE);
 		css_put(&mem->css);
 	}
 	*ptr = mem;
@@ -2549,7 +2565,7 @@ void mem_cgroup_end_migration(struct mem
 	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
 	 * So, double-counting is effectively avoided.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 
 	/*
 	 * Both of oldpage and newpage are still under lock_page().
@@ -4144,7 +4160,8 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+					      PAGE_SIZE);
 		if (ret || !mem)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return -ENOMEM;
@@ -4259,6 +4276,7 @@ static int mem_cgroup_count_precharge_pt
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
@@ -4407,6 +4425,7 @@ static int mem_cgroup_move_charge_pte_ra
 	spinlock_t *ptl;
 
 retry:
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 32 of 41] memcg huge memory
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
@ 2010-04-02  0:41 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:41 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -225,6 +225,7 @@ static int __do_huge_pmd_anonymous_page(
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -235,6 +236,7 @@ static int __do_huge_pmd_anonymous_page(
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -278,6 +280,10 @@ int do_huge_pmd_anonymous_page(struct mm
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 	}
@@ -377,9 +383,15 @@ static int do_huge_pmd_wp_page_fallback(
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 					  vma, address);
-		if (unlikely(!pages[i])) {
-			while (--i >= 0)
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_newpage_charge(pages[i], mm,
+						       GFP_KERNEL))) {
+			if (pages[i])
 				put_page(pages[i]);
+			while (--i >= 0) {
+				mem_cgroup_uncharge_page(pages[i]);
+				put_page(pages[i]);
+			}
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
@@ -438,8 +450,10 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out;
 }
@@ -482,13 +496,19 @@ int do_huge_pmd_wp_page(struct mm_struct
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
 	copy_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 33 of 41] transparent hugepage vmstat
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (31 preceding siblings ...)
  2010-04-02  0:41 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -151,6 +154,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,7 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -726,6 +726,9 @@ static void __split_huge_page_refcount(s
 		put_page(page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -795,8 +795,13 @@ void page_add_anon_rmap(struct page *pag
 	struct vm_area_struct *vma, unsigned long address)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageTransHuge(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -824,7 +829,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageTransHuge(page))
+	    __inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+	    __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -871,7 +879,11 @@ void page_remove_rmap(struct page *page)
 	}
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageTransHuge(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -657,6 +657,9 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"nr_anon_transparent_hugepages",
+#endif
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 34 of 41] khugepaged
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (32 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 35 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)

The fundamental reason why khugepaged is unavoidable, is that some
memory can be fragmented and not everything can be relocated. So when
a virtual machine quits and releases gigabytes of hugepages, we want
to use those freely available hugepages to create huge-pmd in the
other virtual machines that may be running on fragmented memory, to
maximize the CPU efficiency at all times. The scan is slow, it takes
nearly zero cpu time, except when it copies data (in which case it
means we definitely want to pay for that cpu time) so it seems a good
tradeoff.

In addition to the hugepages being released by other process releasing memory,
we have the strong suspicion that the performance impact of potentially
defragmenting hugepages during or before each page fault could lead to more
performance inconsistency than allocating small pages at first and having them
collapsed into large pages later... if they prove themselfs to be long lived
mappings (khugepaged scan is slow so short lived mappings have low probability
to run into khugepaged if compared to long lived mappings).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -23,8 +23,11 @@ extern int zap_huge_pmd(struct mmu_gathe
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
new file mode 100644
--- /dev/null
+++ b/include/linux/khugepaged.h
@@ -0,0 +1,66 @@
+#ifndef _LINUX_KHUGEPAGED_H
+#define _LINUX_KHUGEPAGED_H
+
+#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_exit(struct mm_struct *mm);
+extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma);
+
+#define khugepaged_enabled()					       \
+	(transparent_hugepage_flags &				       \
+	 ((1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG) |		       \
+	  (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG)))
+#define khugepaged_always()				\
+	(transparent_hugepage_flags &			\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG))
+#define khugepaged_req_madv()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG))
+#define khugepaged_defrag()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
+
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		return __khugepaged_enter(mm);
+	return 0;
+}
+
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
+		if (khugepaged_always() ||
+		    (khugepaged_req_madv() &&
+		     vma->vm_flags & VM_HUGEPAGE))
+			if (__khugepaged_enter(vma->vm_mm))
+				return -ENOMEM;
+	return 0;
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	return 0;
+}
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+}
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -435,6 +435,7 @@ extern int get_dumpable(struct mm_struct
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,6 +65,7 @@
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
+#include <linux/khugepaged.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -327,6 +328,9 @@ static int dup_mmap(struct mm_struct *mm
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -539,6 +543,7 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
+		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,14 +12,124 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+/*
+ * By default transparent hugepage support is enabled for all mappings
+ * and khugepaged scans all mappings. Defrag is only invoked by
+ * khugepaged hugepage allocations and by page faults inside
+ * MADV_HUGEPAGE regions to avoid the risk of slowing down short lived
+ * allocations.
+ */
 unsigned long transparent_hugepage_flags __read_mostly =
-	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+
+/* default scan 8*512 pte (or vmas) every 30 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
+static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_full_scans;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+/*
+ * default collapse hugepages if there is at least one pte mapped like
+ * it would have happened if the vma was large enough during page
+ * fault.
+ */
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+
+static int khugepaged(void *none);
+static int mm_slots_hash_init(void);
+static int khugepaged_slab_init(void);
+static void khugepaged_slab_free(void);
+
+#define MM_SLOTS_HASH_HEADS 1024
+static struct hlist_head *mm_slots_hash __read_mostly;
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+} khugepaged_scan = {
+	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+static int start_khugepaged(void)
+{
+	int err = 0;
+	if (khugepaged_enabled()) {
+		int wakeup;
+		if (unlikely(!mm_slot_cache || !mm_slots_hash)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_thread)
+			khugepaged_thread = kthread_run(khugepaged, NULL,
+							"khugepaged");
+		if (unlikely(IS_ERR(khugepaged_thread))) {
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				  &transparent_hugepage_flags);
+			clear_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
+				  &transparent_hugepage_flags);
+			printk(KERN_ERR
+			       "khugepaged: kthread_run(khugepaged) failed\n");
+			err = PTR_ERR(khugepaged_thread);
+			khugepaged_thread = NULL;
+		}
+		wakeup = !list_empty(&khugepaged_scan.mm_head);
+		mutex_unlock(&khugepaged_mutex);
+		if (wakeup)
+			wake_up_interruptible(&khugepaged_wait);
+	} else
+		/* wakeup to exit */
+		wake_up_interruptible(&khugepaged_wait);
+out:
+	return err;
+}
 
 #ifdef CONFIG_SYSFS
+
+static void wakeup_khugepaged(void)
+{
+	mutex_lock(&khugepaged_mutex);
+	if (khugepaged_thread)
+		wake_up_process(khugepaged_thread);
+	mutex_unlock(&khugepaged_mutex);
+}
+
 static ssize_t double_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag enabled,
@@ -153,20 +263,240 @@ static struct attribute *hugepage_attr[]
 
 static struct attribute_group hugepage_attr_group = {
 	.attrs = hugepage_attr,
-	.name = "transparent_hugepage",
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_scan_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+	__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+	       scan_sleep_millisecs_store);
+
+static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
+}
+
+static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_alloc_sleep_millisecs = msecs;
+	wakeup_khugepaged();
+
+	return count;
+}
+static struct kobj_attribute alloc_sleep_millisecs_attr =
+	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
+	       alloc_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long pages;
+
+	err = strict_strtoul(buf, 10, &pages);
+	if (err || !pages || pages > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_pages_to_scan = pages;
+
+	return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+	__ATTR(pages_to_scan, 0644, pages_to_scan_show,
+	       pages_to_scan_store);
+
+static ssize_t pages_collapsed_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
+}
+static struct kobj_attribute pages_collapsed_attr =
+	__ATTR_RO(pages_collapsed);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_full_scans);
+}
+static struct kobj_attribute full_scans_attr =
+	__ATTR_RO(full_scans);
+
+static ssize_t khugepaged_enabled_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+}
+static ssize_t khugepaged_enabled_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = double_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+				TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG);
+	if (ret > 0) {
+		int err = start_khugepaged();
+		if (err)
+			ret = err;
+	}
+	return ret;
+}
+static struct kobj_attribute khugepaged_enabled_attr =
+	__ATTR(enabled, 0644, khugepaged_enabled_show,
+	       khugepaged_enabled_store);
+
+static ssize_t khugepaged_defrag_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static ssize_t khugepaged_defrag_store(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static struct kobj_attribute khugepaged_defrag_attr =
+	__ATTR(defrag, 0644, khugepaged_defrag_show,
+	       khugepaged_defrag_store);
+
+/*
+ * max_ptes_none controls if khugepaged should collapse hugepages over
+ * any unmapped ptes in turn potentially increasing the memory
+ * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
+ * reduce the available free memory in the system as it
+ * runs. Increasing max_ptes_none will instead potentially reduce the
+ * free memory in the system during the khugepaged scan.
+ */
+static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
+}
+static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
+					      struct kobj_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int err;
+	unsigned long max_ptes_none;
+
+	err = strict_strtoul(buf, 10, &max_ptes_none);
+	if (err || max_ptes_none > HPAGE_PMD_NR-1)
+		return -EINVAL;
+
+	khugepaged_max_ptes_none = max_ptes_none;
+
+	return count;
+}
+static struct kobj_attribute khugepaged_max_ptes_none_attr =
+	__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
+	       khugepaged_max_ptes_none_store);
+
+static struct attribute *khugepaged_attr[] = {
+	&khugepaged_enabled_attr.attr,
+	&khugepaged_defrag_attr.attr,
+	&khugepaged_max_ptes_none_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_collapsed_attr.attr,
+	&full_scans_attr.attr,
+	&scan_sleep_millisecs_attr.attr,
+	&alloc_sleep_millisecs_attr.attr,
+	NULL,
+};
+
+static struct attribute_group khugepaged_attr_group = {
+	.attrs = khugepaged_attr,
+	.name = "khugepaged",
 };
 #endif /* CONFIG_SYSFS */
 
 static int __init hugepage_init(void)
 {
+	int err;
 #ifdef CONFIG_SYSFS
-	int err;
+	static struct kobject *hugepage_kobj;
 
-	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	err = -ENOMEM;
+	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
+	if (unlikely(!hugepage_kobj)) {
+		printk(KERN_ERR "hugepage: failed kobject create\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+#endif
+
+	err = khugepaged_slab_init();
 	if (err)
-		printk(KERN_ERR "hugepage: register sysfs failed\n");
-#endif
-	return 0;
+		goto out;
+
+	err = mm_slots_hash_init();
+	if (err) {
+		khugepaged_slab_free();
+		goto out;
+	}
+
+	start_khugepaged();
+
+out:
+	return err;
 }
 module_init(hugepage_init)
 
@@ -183,6 +513,15 @@ static int __init setup_transparent_huge
 		       transparent_hugepage_flags);
 		transparent_hugepage_flags = 0;
 	}
+	if (test_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_FLAG,
+		     &transparent_hugepage_flags) &&
+	    test_bit(TRANSPARENT_HUGEPAGE_KHUGEPAGED_REQ_MADV_FLAG,
+		     &transparent_hugepage_flags)) {
+		printk(KERN_WARNING
+		       "transparent_hugepage=%lu invalid parameter, disabling",
+		       transparent_hugepage_flags);
+		transparent_hugepage_flags = 0;
+	}
 	return 1;
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
@@ -277,6 +616,8 @@ int do_huge_pmd_anonymous_page(struct mm
 	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
@@ -881,6 +1222,755 @@ int hugepage_madvise(unsigned long *vm_f
 	return 0;
 }
 
+static int __init khugepaged_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init khugepaged_slab_free(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+	mm_slot_cache = NULL;
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static int __init mm_slots_hash_init(void)
+{
+	mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head),
+				GFP_KERNEL);
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	return 0;
+}
+
+#if 0
+static void __init mm_slots_hash_free(void)
+{
+	kfree(mm_slots_hash);
+	mm_slots_hash = NULL;
+}
+#endif
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int wakeup;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	/* __khugepaged_exit() must not run from under us */
+	VM_BUG_ON(khugepaged_test_exit(mm));
+	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
+		free_mm_slot(mm_slot);
+		return 0;
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	wakeup = list_empty(&khugepaged_scan.mm_head);
+	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+	spin_unlock(&khugepaged_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	if (wakeup)
+		wake_up_interruptible(&khugepaged_wait);
+
+	return 0;
+}
+
+int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	unsigned long hstart, hend;
+	if (!vma->anon_vma)
+		/*
+		 * Not yet faulted in so we will register later in the
+		 * page fault if needed.
+		 */
+		return 0;
+	if (vma->vm_file || vma->vm_ops)
+		/* khugepaged not yet working on file or special mappings */
+		return 0;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (hstart < hend)
+		return khugepaged_enter(vma);
+	return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+
+	if (free) {
+		spin_unlock(&khugepaged_mm_lock);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		spin_unlock(&khugepaged_mm_lock);
+		/*
+		 * This is required to serialize against
+		 * khugepaged_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	} else
+		spin_unlock(&khugepaged_mm_lock);
+}
+
+static void release_pte_page(struct page *page)
+{
+	/* 0 stands for page_is_file_cache(page) == false */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+	while (--_pte >= pte) {
+		pte_t pteval = *_pte;
+		if (!pte_none(pteval))
+			release_pte_page(pte_page(pteval));
+	}
+}
+
+static void release_all_pte_pages(pte_t *pte)
+{
+	release_pte_pages(pte, pte + HPAGE_PMD_NR);
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long address,
+					pte_t *pte)
+{
+	struct page *page;
+	pte_t *_pte;
+	int referenced = 0, isolated = 0, none = 0;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else {
+				release_pte_pages(pte, _pte);
+				goto out;
+			}
+		}
+		if (!pte_present(pteval) || !pte_write(pteval)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		page = vm_normal_page(vma, address, pteval);
+		if (unlikely(!page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * is needed to serialize against split_huge_page
+		 * when invoked from the VM.
+		 */
+		if (!trylock_page(page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* 0 stands for page_is_file_cache(page) == false */
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		VM_BUG_ON(!PageLocked(page));
+		VM_BUG_ON(PageLRU(page));
+
+		/* If there is no mapped pte young don't collapse the page */
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (unlikely(!referenced))
+		release_all_pte_pages(pte);
+	else
+		isolated = 1;
+out:
+	return isolated;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address,
+				      spinlock_t *ptl)
+{
+	pte_t *_pte;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		pte_t pteval = *_pte;
+		struct page *src_page;
+
+		if (pte_none(pteval)) {
+			clear_user_highpage(page, address);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		} else {
+			src_page = pte_page(pteval);
+			copy_user_highpage(page, src_page, address, vma);
+			VM_BUG_ON(page_mapcount(src_page) != 1);
+			VM_BUG_ON(page_count(src_page) != 2);
+			release_pte_page(src_page);
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pte_clear(vma->vm_mm, address, _pte);
+			page_remove_rmap(src_page);
+			spin_unlock(ptl);
+			free_page_and_swap_cache(src_page);
+		}
+
+		address += PAGE_SIZE;
+		page++;
+	}
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	struct vm_area_struct *vma;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	pgtable_t pgtable;
+	struct page *new_page;
+	spinlock_t *ptl;
+	int isolated;
+	unsigned long hstart, hend;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(!*hpage);
+
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later hanlded by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	down_write(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	vma = find_vma(mm, address);
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
+		goto out;
+
+	if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+		goto out;
+
+	/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		goto out;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	/* pmd can't go away or become huge under us */
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	new_page = *hpage;
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
+		goto out;
+
+	/*
+	 * Stop anon_vma rmap pagetable access. vma->anon_vma->lock is
+	 * enough for now (we don't need to check each anon_vma
+	 * pointed by each page->mapping) because collapse_huge_page
+	 * only works on not-shared anon pages (that are guaranteed to
+	 * belong to vma->anon_vma).
+	 */
+	spin_lock(&vma->anon_vma->lock);
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	/*
+	 * After this gup_fast can't run anymore. This also removes
+	 * any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address
+	 * to avoid the risk of CPU bugs in that area.
+	 */
+	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	spin_lock(ptl);
+	isolated = __collapse_huge_page_isolate(vma, address, pte);
+	spin_unlock(ptl);
+	pte_unmap(pte);
+
+	if (unlikely(!isolated)) {
+		spin_lock(&mm->page_table_lock);
+		BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, address, pmd, _pmd);
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&vma->anon_vma->lock);
+		mem_cgroup_uncharge_page(new_page);
+		goto out;
+	}
+
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+	spin_unlock(&vma->anon_vma->lock);
+
+	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	__SetPageUptodate(new_page);
+	pgtable = pmd_pgtable(_pmd);
+	VM_BUG_ON(page_count(pgtable) != 1);
+	VM_BUG_ON(page_mapcount(pgtable) != 0);
+
+	_pmd = mk_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+	_pmd = pmd_mkhuge(_pmd);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	BUG_ON(!pmd_none(*pmd));
+	page_add_new_anon_rmap(new_page, vma, address);
+	set_pmd_at(mm, address, pmd, _pmd);
+	update_mmu_cache(vma, address, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes--;
+	spin_unlock(&mm->page_table_lock);
+
+	*hpage = NULL;
+	khugepaged_pages_collapsed++;
+out:
+	up_write(&mm->mmap_sem);
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	int ret = 0, referenced = 0, none = 0;
+	struct page *page;
+	unsigned long _address;
+	spinlock_t *ptl;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else
+				goto out_unmap;
+		}
+		if (!pte_present(pteval) || !pte_write(pteval))
+			goto out_unmap;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			goto out_unmap;
+		VM_BUG_ON(PageCompound(page));
+		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+			goto out_unmap;
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1)
+			goto out_unmap;
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (referenced)
+		ret = 1;
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	if (ret) {
+		up_read(&mm->mmap_sem);
+		collapse_huge_page(mm, address, hpage);
+	}
+out:
+	return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+
+		/*
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		 */
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+					    struct page **hpage)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	VM_BUG_ON(!pages);
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_scan.mm_slot)
+		mm_slot = khugepaged_scan.mm_slot;
+	else {
+		mm_slot = list_entry(khugepaged_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		khugepaged_scan.address = 0;
+		khugepaged_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&khugepaged_mm_lock);
+
+	mm = mm_slot->mm;
+	down_read(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, khugepaged_scan.address);
+
+	progress++;
+	for (; vma; vma = vma->vm_next) {
+		unsigned long hstart, hend;
+
+		cond_resched();
+		if (unlikely(khugepaged_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!(vma->vm_flags & VM_HUGEPAGE) &&
+		    !khugepaged_always()) {
+			progress++;
+			continue;
+		}
+
+		/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+		if (!vma->anon_vma || vma->vm_ops || vma->vm_file) {
+			khugepaged_scan.address = vma->vm_end;
+			progress++;
+			continue;
+		}
+		VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+		hend = vma->vm_end & HPAGE_PMD_MASK;
+		if (hstart >= hend) {
+			progress++;
+			continue;
+		}
+		if (khugepaged_scan.address < hstart)
+			khugepaged_scan.address = hstart;
+		if (khugepaged_scan.address > hend) {
+			khugepaged_scan.address = hend + HPAGE_PMD_SIZE;
+			progress++;
+			continue;
+		}
+		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+		while (khugepaged_scan.address < hend) {
+			int ret;
+			cond_resched();
+			if (unlikely(khugepaged_test_exit(mm)))
+				goto breakouterloop;
+
+			VM_BUG_ON(khugepaged_scan.address < hstart ||
+				  khugepaged_scan.address + HPAGE_PMD_SIZE >
+				  hend);
+			ret = khugepaged_scan_pmd(mm, vma,
+						  khugepaged_scan.address,
+						  hpage);
+			/* move to next address */
+			khugepaged_scan.address += HPAGE_PMD_SIZE;
+			progress += HPAGE_PMD_NR;
+			if (ret)
+				/* we released mmap_sem so break loop */
+				goto breakouterloop_mmap_sem;
+			if (progress >= pages)
+				goto breakouterloop;
+		}
+	}
+breakouterloop:
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+	spin_lock(&khugepaged_mm_lock);
+	BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (khugepaged_test_exit(mm) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * khugepaged runs here, khugepaged_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+			khugepaged_scan.mm_slot = list_entry(
+				mm_slot->mm_node.next,
+				struct mm_slot, mm_node);
+			khugepaged_scan.address = 0;
+		} else {
+			khugepaged_scan.mm_slot = NULL;
+			khugepaged_full_scans++;
+		}
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) ||
+		!khugepaged_enabled();
+}
+
+static void khugepaged_do_scan(struct page **hpage)
+{
+	unsigned int progress = 0, pass_through_head = 0;
+	unsigned int pages = khugepaged_pages_to_scan;
+
+	barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+	while (progress < pages) {
+		cond_resched();
+
+		if (!*hpage) {
+			*hpage = alloc_hugepage(khugepaged_defrag());
+			if (unlikely(!*hpage))
+				break;
+		}
+
+		spin_lock(&khugepaged_mm_lock);
+		if (!khugepaged_scan.mm_slot)
+			pass_through_head++;
+		if (khugepaged_has_work() &&
+		    pass_through_head < 2)
+			progress += khugepaged_scan_mm_slot(pages - progress,
+							    hpage);
+		else
+			progress = pages;
+		spin_unlock(&khugepaged_mm_lock);
+	}
+}
+
+static struct page *khugepaged_alloc_hugepage(void)
+{
+	struct page *hpage;
+
+	do {
+		hpage = alloc_hugepage(khugepaged_defrag());
+		if (!hpage)
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_alloc_sleep_millisecs));
+	} while (unlikely(!hpage) &&
+		 likely(khugepaged_enabled()));
+	return hpage;
+}
+
+static void khugepaged_loop(void)
+{
+	struct page *hpage;
+
+	while (likely(khugepaged_enabled())) {
+		hpage = khugepaged_alloc_hugepage();
+		if (unlikely(!hpage))
+			break;
+
+		khugepaged_do_scan(&hpage);
+		if (hpage)
+			put_page(hpage);
+		if (khugepaged_has_work()) {
+			if (!khugepaged_scan_sleep_millisecs)
+				continue;
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_scan_sleep_millisecs));
+		} else if (khugepaged_enabled())
+			wait_event_interruptible(khugepaged_wait,
+						 khugepaged_wait_event());
+	}
+}
+
+static int khugepaged(void *none)
+{
+	struct mm_slot *mm_slot;
+
+	set_user_nice(current, 19);
+
+	for (;;) {
+		BUG_ON(khugepaged_thread != current);
+		khugepaged_loop();
+		BUG_ON(khugepaged_thread != current);
+
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_enabled())
+			break;
+		mutex_unlock(&khugepaged_mutex);
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = khugepaged_scan.mm_slot;
+	khugepaged_scan.mm_slot = NULL;
+	if (mm_slot)
+		collect_mm_slot(mm_slot);
+	spin_unlock(&khugepaged_mm_lock);
+
+	khugepaged_thread = NULL;
+	mutex_unlock(&khugepaged_mutex);
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -28,6 +28,7 @@
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_event.h>
+#include <linux/khugepaged.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -800,6 +801,7 @@ struct vm_area_struct *vma_merge(struct 
 				end, prev->vm_pgoff, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(prev);
 		return prev;
 	}
 
@@ -818,6 +820,7 @@ struct vm_area_struct *vma_merge(struct 
 				next->vm_pgoff - pglen, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(area);
 		return area;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 35 of 41] skip transhuge pages in ksm for now
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (33 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 36 of 41] remove PG_buddy Andrea Arcangeli
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

Skip transhuge pages in ksm for now.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -449,7 +449,7 @@ static struct page *get_mergeable_page(s
 	page = follow_page(vma, addr, FOLL_GET);
 	if (!page)
 		goto out;
-	if (PageAnon(page)) {
+	if (PageAnon(page) && !PageTransCompound(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -1294,7 +1294,19 @@ next_mm:
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (*page && PageAnon(*page)) {
+			if (!*page) {
+				ksm_scan.address += PAGE_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageTransCompound(*page)) {
+				put_page(*page);
+				ksm_scan.address &= HPAGE_PMD_MASK;
+				ksm_scan.address += HPAGE_PMD_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
@@ -1308,8 +1320,7 @@ next_mm:
 				up_read(&mm->mmap_sem);
 				return rmap_item;
 			}
-			if (*page)
-				put_page(*page);
+			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 36 of 41] remove PG_buddy
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (34 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 35 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Andrea Arcangeli <aarcange@redhat.com>

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
added to page->flags without overflowing (because of the sparse section bits
increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
the memory hotplug code from _mapcount to lru.next to avoid any risk of
clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
lru.next even more easily than the mapcount instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/proc/page.c b/fs/proc/page.c
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -116,15 +116,17 @@ u64 stable_page_flags(struct page *page)
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
 
+	/*
+	 * Caveats on high order pages: page->_count will only be set
+	 * -1 on the head page; SLUB/SLQB do the same for PG_slab;
+	 * SLOB won't set PG_slab at all on compound pages.
+	 */
+	if (PageBuddy(page))
+		u |= 1 << KPF_BUDDY;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
-	/*
-	 * Caveats on high order pages:
-	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
-	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
-	 */
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
-	u |= kpf_copy_bit(k, KPF_BUDDY,		PG_buddy);
 
 	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
 	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -13,12 +13,16 @@ struct mem_section;
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /*
- * Types for free bootmem.
- * The normal smallest mapcount is -1. Here is smaller value than it.
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
  */
-#define SECTION_INFO		(-1 - 1)
-#define MIX_SECTION_INFO	(-1 - 2)
-#define NODE_INFO		(-1 - 3)
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
 
 /*
  * pgdat resizing functions
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -358,6 +358,27 @@ static inline void init_page_count(struc
 	atomic_set(&page->_count, 1);
 }
 
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+static inline int PageBuddy(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == -2;
+}
+
+static inline void __SetPageBuddy(struct page *page)
+{
+	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+	atomic_set(&page->_mapcount, -2);
+}
+
+static inline void __ClearPageBuddy(struct page *page)
+{
+	VM_BUG_ON(!PageBuddy(page));
+	atomic_set(&page->_mapcount, -1);
+}
+
 void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -48,9 +48,6 @@
  * struct page (these bits with information) are always mapped into kernel
  * address space...
  *
- * PG_buddy is set to indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
@@ -96,7 +93,6 @@ enum pageflags {
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
-	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
 #ifdef CONFIG_MMU
@@ -235,7 +231,6 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
  * risky: they bypass page accounting.
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
-__PAGEFLAG(Buddy, buddy)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
@@ -430,7 +425,7 @@ static inline void ClearPageCompound(str
 #define PAGE_FLAGS_CHECK_AT_FREE \
 	(1 << PG_lru	 | 1 << PG_locked    | \
 	 1 << PG_private | 1 << PG_private_2 | \
-	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
+	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
 	 __PG_COMPOUND_LOCK)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -65,9 +65,10 @@ static void release_memory_resource(stru
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page, int type)
+static void get_page_bootmem(unsigned long info,  struct page *page,
+			     unsigned long type)
 {
-	atomic_set(&page->_mapcount, type);
+	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
 	set_page_private(page, info);
 	atomic_inc(&page->_count);
@@ -77,15 +78,16 @@ static void get_page_bootmem(unsigned lo
  * so use __ref to tell modpost not to generate a warning */
 void __ref put_page_bootmem(struct page *page)
 {
-	int type;
+	unsigned long type;
 
-	type = atomic_read(&page->_mapcount);
-	BUG_ON(type >= -1);
+	type = (unsigned long) page->lru.next;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (atomic_dec_return(&page->_count) == 1) {
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
-		reset_page_mapcount(page);
+		INIT_LIST_HEAD(&page->lru);
 		__free_pages_bootmem(page, 0);
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,8 +426,8 @@ __find_combined_index(unsigned long page
  * (c) a page and its buddy have the same order &&
  * (d) a page and its buddy are in the same zone.
  *
- * For recording whether a page is in the buddy system, we use PG_buddy.
- * Setting, clearing, and testing PG_buddy is serialized by zone->lock.
+ * For recording whether a page is in the buddy system, we set ->_mapcount -2.
+ * Setting, clearing, and testing _mapcount -2 is serialized by zone->lock.
  *
  * For recording page's order, we use page_private(page).
  */
@@ -460,7 +460,7 @@ static inline int page_is_buddy(struct p
  * as necessary, plus some accounting needed to play nicely with other
  * parts of the VM system.
  * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with PG_buddy. Page's
+ * free pages of length of (1 << order) and marked with _mapcount -2. Page's
  * order is recorded in page_private(page) field.
  * So when we are allocating or freeing one, we can derive the state of the
  * other.  That is, if we allocate a small block, and both were   
@@ -5251,7 +5251,6 @@ static struct trace_print_flags pageflag
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
-	{1UL << PG_buddy,		"buddy"		},
 	{1UL << PG_swapbacked,		"swapbacked"	},
 	{1UL << PG_unevictable,		"unevictable"	},
 #ifdef CONFIG_MMU
diff --git a/mm/sparse.c b/mm/sparse.c
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -670,10 +670,10 @@ static void __kfree_section_memmap(struc
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
-	int magic;
+	unsigned long magic;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = atomic_read(&page->_mapcount);
+		magic = (unsigned long) page->lru.next;
 
 		BUG_ON(magic == NODE_INFO);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 37 of 41] add x86 32bit support
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (35 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 36 of 41] remove PG_buddy Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Johannes Weiner <hannes@cmpxchg.org>

Add support for transparent hugepages to x86 32bit.

Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support
transparent hugepages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -46,6 +46,15 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+	return __pmd(xchg((pmdval_t *)xp, 0));
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken,
  * split up the 29 bits of offset into this range:
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -104,6 +104,29 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+union split_pmd {
+	struct {
+		u32 pmd_low;
+		u32 pmd_high;
+	};
+	pmd_t pmd;
+};
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	union split_pmd res, *orig = (union split_pmd *)pmdp;
+
+	/* xchg acts as a barrier before setting of the high bits */
+	res.pmd_low = xchg(&orig->pmd_low, 0);
+	res.pmd_high = orig->pmd_high;
+	orig->pmd_high = 0;
+
+	return res.pmd;
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits 0, 6 and 7 are taken in the low part of the pte,
  * put the 32 bits of offset into the high part.
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -95,6 +95,11 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
@@ -143,6 +148,18 @@ static inline int pmd_large(pmd_t pte)
 		(_PAGE_PSE | _PAGE_PRESENT);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
 {
 	pteval_t v = native_pte_val(pte);
@@ -217,6 +234,55 @@ static inline pte_t pte_mkspecial(pte_t 
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 /*
  * Mask out unsupported bits in a present pgprot.  Non-present pgprots
  * can use those bits for other purposes, so leave them be.
@@ -525,6 +591,14 @@ static inline pte_t native_local_ptep_ge
 	return res;
 }
 
+static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	pmd_t res = *pmdp;
+
+	native_pmd_clear(pmdp);
+	return res;
+}
+
 static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
 				     pte_t *ptep , pte_t pte)
 {
@@ -612,6 +686,49 @@ static inline void ptep_set_wrprotect(st
 	pte_update(mm, addr, ptep);
 }
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	pmd_update(mm, addr, pmdp);
+}
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -182,115 +182,6 @@ extern void cleanup_highmap(void);
 
 #define __HAVE_ARCH_PTE_SAME
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
-static inline int pmd_trans_huge(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_PSE;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
-#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
-
-#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
-extern int pmdp_set_access_flags(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp,
-				 pmd_t entry, int dirty);
-
-#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
-				     unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
-extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
-				  unsigned long address, pmd_t *pmdp);
-
-
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMD_WRITE
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
-#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
-static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmdp)
-{
-	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
-	pmd_update(mm, addr, pmdp);
-	return pmd;
-}
-
-#define __HAVE_ARCH_PMDP_SET_WRPROTECT
-static inline void pmdp_set_wrprotect(struct mm_struct *mm,
-				      unsigned long addr, pmd_t *pmdp)
-{
-	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
-	pmd_update(mm, addr, pmdp);
-}
-
-static inline int pmd_young(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_ACCESSED;
-}
-
-static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v | set);
-}
-
-static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v & ~clear);
-}
-
-static inline pmd_t pmd_mkold(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_wrprotect(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mkdirty(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DIRTY);
-}
-
-static inline pmd_t pmd_mkhuge(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_PSE);
-}
-
-static inline pmd_t pmd_mkyoung(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mknotpresent(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
-}
-
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -351,7 +351,7 @@ int pmdp_test_and_clear_young(struct vm_
 
 	if (pmd_young(*pmdp))
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
-					 (unsigned long *) &pmdp->pmd);
+					 (unsigned long *)pmdp);
 
 	if (ret)
 		pmd_update(vma->vm_mm, addr, pmdp);
@@ -393,7 +393,7 @@ void pmdp_splitting_flush(struct vm_area
 	int set;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)&pmdp->pmd);
+				(unsigned long *)pmdp);
 	if (set) {
 		pmd_update(vma->vm_mm, address, pmdp);
 		/* need tlb flush only to serialize against gup-fast */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,7 +98,11 @@ extern unsigned int kobjsize(const void 
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
+#else
+#define VM_HUGEPAGE	0x01000000	/* MADV_HUGEPAGE marked this vma */
+#endif
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
@@ -107,9 +111,6 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
-#if BITS_PER_LONG > 32
-#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
-#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -290,7 +290,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage support" if EMBEDDED
-	depends on X86_64
+	depends on X86
 	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 38 of 41] mincore transparent hugepage support
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (36 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Johannes Weiner <hannes@cmpxchg.org>

Handle transparent huge page pmd entries natively instead of splitting
them into subpages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pm
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd);
+extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, unsigned long end,
+			unsigned char *vec);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -936,6 +936,31 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 	return ret;
 }
 
+int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, unsigned long end,
+		unsigned char *vec)
+{
+	int ret = 0;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		ret = !pmd_trans_splitting(*pmd);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (unlikely(!ret))
+			wait_split_huge_page(vma->anon_vma, pmd);
+		else {
+			/*
+			 * All logical pages in the range are present
+			 * if backed by a huge page.
+			 */
+			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,7 +154,13 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (mincore_huge_pmd(vma, pmd, addr, next, vec)) {
+				vec += (next - addr) >> PAGE_SHIFT;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 39 of 41] add pmd_modify
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (37 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
                   ` (2 subsequent siblings)
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Johannes Weiner <hannes@cmpxchg.org>

Add pmd_modify() for use with mprotect() on huge pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -323,6 +323,16 @@ static inline pte_t pte_modify(pte_t pte
 	return __pte(val);
 }
 
+static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+{
+	pmdval_t val = pmd_val(pmd);
+
+	val &= _HPAGE_CHG_MASK;
+	val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+
+	return __pmd(val);
+}
+
 /* mprotect needs to preserve PAT bits when updating vm_page_prot */
 #define pgprot_modify pgprot_modify
 static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -72,6 +72,7 @@
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 #define _PAGE_CACHE_MASK	(_PAGE_PCD | _PAGE_PWT)
 #define _PAGE_CACHE_WB		(0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 40 of 41] mprotect: pass vma down to page table walkers
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (38 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-02  0:42 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
  2010-04-05 19:09 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Andrew Morton
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Johannes Weiner <hannes@cmpxchg.org>

Waiting for huge pmds to finish splitting requires the vma's anon_vma,
so pass along the vma instead of the mm, we can always get the latter
when we need it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -36,10 +36,11 @@ static inline pgprot_t pgprot_modify(pgp
 }
 #endif
 
-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static void change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 
@@ -79,7 +80,7 @@ static void change_pte_range(struct mm_s
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
+static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -89,14 +90,14 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(mm, pmd);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
+		change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -108,7 +109,7 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable);
+		change_pmd_range(vma, pud, addr, next, newprot, dirty_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -128,7 +129,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
+		change_pud_range(vma, pgd, addr, next, newprot, dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 41 of 41] mprotect: transparent huge page support
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (39 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
@ 2010-04-02  0:42 ` Andrea Arcangeli
  2010-04-05 19:09 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Andrew Morton
  41 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-02  0:42 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

From: Johannes Weiner <hannes@cmpxchg.org>

Natively handle huge pmds when changing page tables on behalf of
mprotect().

I left out update_mmu_cache() because we do not need it on x86 anyway
but more importantly the interface works on ptes, not pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,8 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -961,6 +961,33 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, pgprot_t newprot)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			pmd_t entry;
+
+			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmd_modify(entry, newprot);
+			set_pmd_at(mm, addr, pmd, entry);
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,13 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (change_huge_pmd(vma, pmd, addr, newprot))
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
                   ` (40 preceding siblings ...)
  2010-04-02  0:42 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
@ 2010-04-05 19:09 ` Andrew Morton
  2010-04-05 19:36   ` Ingo Molnar
  41 siblings, 1 reply; 205+ messages in thread
From: Andrew Morton @ 2010-04-05 19:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura


Problem.  It appears that these patches have only been sent to
linux-mm.  Linus doesn't read linux-mm and has never seen them.  I do
think we should get things squared away with him regarding the overall
intent and implementation approach before trying to go further.

I forwarded "[PATCH 27 of 41] transparent hugepage core" and his
summary was "So I don't hate the patch, but it sure as hell doesn't
make me happy either.  And if the only advantage is about TLB miss
costs, I really don't see the point personally.".  So if there's more
benefit to the patches than this, that will need some expounding upon.

So I'd suggest that you a) address some minor Linus comments which I'll
forward separately, b) rework [patch 0/n] to provide a complete
description of the benefits and the downsides (if that isn't there
already) and c) resend everything, cc'ing Linus and linux-kernel and
we'll get it thrashed out.


Sorry.  Normally I use my own judgement on MM patches, but in this case
if I was asked "why did you send all this stuff", I don't believe I
personally have strong enough arguments to justify the changes - you're
in a better position than I to make that case.  Plus this is a *large*
patchset, and it plays in an area where Linus is known to have, err,
opinions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 19:09 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Andrew Morton
@ 2010-04-05 19:36   ` Ingo Molnar
  2010-04-05 20:26     ` Pekka Enberg
  0 siblings, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-05 19:36 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Pekka Enberg


* Andrew Morton <akpm@linux-foundation.org> wrote:

> Problem.  It appears that these patches have only been sent to linux-mm.  
> Linus doesn't read linux-mm and has never seen them.  I do think we should 
> get things squared away with him regarding the overall intent and 
> implementation approach before trying to go further.
> 
> I forwarded "[PATCH 27 of 41] transparent hugepage core" and his summary was 
> "So I don't hate the patch, but it sure as hell doesn't make me happy 
> either.  And if the only advantage is about TLB miss costs, I really don't 
> see the point personally.".  So if there's more benefit to the patches than 
> this, that will need some expounding upon.
> 
> So I'd suggest that you a) address some minor Linus comments which I'll 
> forward separately, b) rework [patch 0/n] to provide a complete description 
> of the benefits and the downsides (if that isn't there already) and c) 
> resend everything, cc'ing Linus and linux-kernel and we'll get it thrashed 
> out.
> 
> Sorry.  Normally I use my own judgement on MM patches, but in this case if I 
> was asked "why did you send all this stuff", I don't believe I personally 
> have strong enough arguments to justify the changes - you're in a better 
> position than I to make that case.  Plus this is a *large* patchset, and it 
> plays in an area where Linus is known to have, err, opinions.

Not sure whether it got mentioned but one area where huge pages are rather 
useful are apps/middleware that does some sort of GC with tons of RAM.

There the 512x reduction in remapping and TLB flush costs (not just TLB miss 
costs) obviously makes for a big difference not just in straight 
performance/latency but also in cache footprint. AFAIK most GC concepts today 
(that cover many gigabytes of memory) are limited by remap and TLB flush 
performance.

So if we accept that shuffling lots of virtual memory is worth doing then the 
next natural step would be to make it transparent.

Just my 2c,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 19:36   ` Ingo Molnar
@ 2010-04-05 20:26     ` Pekka Enberg
  2010-04-05 20:32       ` Linus Torvalds
  0 siblings, 1 reply; 205+ messages in thread
From: Pekka Enberg @ 2010-04-05 20:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Ingo,

On Mon, Apr 5, 2010 at 10:36 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> Problem.  It appears that these patches have only been sent to linux-mm.
>> Linus doesn't read linux-mm and has never seen them.  I do think we should
>> get things squared away with him regarding the overall intent and
>> implementation approach before trying to go further.
>>
>> I forwarded "[PATCH 27 of 41] transparent hugepage core" and his summary was
>> "So I don't hate the patch, but it sure as hell doesn't make me happy
>> either.  And if the only advantage is about TLB miss costs, I really don't
>> see the point personally.".  So if there's more benefit to the patches than
>> this, that will need some expounding upon.
>>
>> So I'd suggest that you a) address some minor Linus comments which I'll
>> forward separately, b) rework [patch 0/n] to provide a complete description
>> of the benefits and the downsides (if that isn't there already) and c)
>> resend everything, cc'ing Linus and linux-kernel and we'll get it thrashed
>> out.
>>
>> Sorry.  Normally I use my own judgement on MM patches, but in this case if I
>> was asked "why did you send all this stuff", I don't believe I personally
>> have strong enough arguments to justify the changes - you're in a better
>> position than I to make that case.  Plus this is a *large* patchset, and it
>> plays in an area where Linus is known to have, err, opinions.
>
> Not sure whether it got mentioned but one area where huge pages are rather
> useful are apps/middleware that does some sort of GC with tons of RAM.

Dunno what your measure of "tons of RAM" is but yeah, IIRC when you go
above 2 GB or so, huge pages are usually a big win.

> There the 512x reduction in remapping and TLB flush costs (not just TLB miss
> costs) obviously makes for a big difference not just in straight
> performance/latency but also in cache footprint. AFAIK most GC concepts today
> (that cover many gigabytes of memory) are limited by remap and TLB flush
> performance.

Which remap are you referring to?

AFAIK, most modern GCs split memory in young and old generation
"zones" and _copy_ surviving objects from the former to the latter if
their lifetime exceeds some threshold. The JVM keeps scanning the
smaller young generation very aggressively which causes TLB pressure
and scans the larger old generation less often.

                       Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:26     ` Pekka Enberg
@ 2010-04-05 20:32       ` Linus Torvalds
  2010-04-05 20:46         ` Pekka Enberg
  2010-04-05 21:01         ` Chris Mason
  0 siblings, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-05 20:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Mon, 5 Apr 2010, Pekka Enberg wrote:
> 
> AFAIK, most modern GCs split memory in young and old generation
> "zones" and _copy_ surviving objects from the former to the latter if
> their lifetime exceeds some threshold. The JVM keeps scanning the
> smaller young generation very aggressively which causes TLB pressure
> and scans the larger old generation less often.

.. my only input to this is: numbers talk, bullsh*t walks. 

I'm not interested in micro-benchmarks, either. I can show infinite TLB 
walk improvement in a microbenchmark.

In order for me to be interested in any complex hugetlb crap, I want real 
numbers from real applications. Not "it takes this many cycles to walk a 
page table", or "it could matter under these circumstances".

I also want those real numbers _not_ directly after a clean reboot, but 
after running other real loads on the machine that have actually used up 
all the memory and filled it with things like dentry data etc. The "right 
after boot" case is totally pointless, since a huge part of hugetlb 
entries is the ability to allocate those physically contiguous and 
well-aligned regions.

Until then, it's just extra complexity for no actual gain.

Oh, and while I'm at it, I want a pony too.

			Linus

PS. I also think the current odd anonvma thing is _way_ more important. 
That was a feature that actually improved AIM throughput by 300%. Now, 
admittedly that's not a real load either, but at least it's not a total 
microbenchmark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:32       ` Linus Torvalds
@ 2010-04-05 20:46         ` Pekka Enberg
  2010-04-05 20:58           ` Linus Torvalds
  2010-04-05 21:01         ` Chris Mason
  1 sibling, 1 reply; 205+ messages in thread
From: Pekka Enberg @ 2010-04-05 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Linus,

On Mon, Apr 5, 2010 at 11:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> AFAIK, most modern GCs split memory in young and old generation
>> "zones" and _copy_ surviving objects from the former to the latter if
>> their lifetime exceeds some threshold. The JVM keeps scanning the
>> smaller young generation very aggressively which causes TLB pressure
>> and scans the larger old generation less often.
>
> .. my only input to this is: numbers talk, bullsh*t walks.
>
> I'm not interested in micro-benchmarks, either. I can show infinite TLB
> walk improvement in a microbenchmark.
>
> In order for me to be interested in any complex hugetlb crap, I want real
> numbers from real applications. Not "it takes this many cycles to walk a
> page table", or "it could matter under these circumstances".
>
> I also want those real numbers _not_ directly after a clean reboot, but
> after running other real loads on the machine that have actually used up
> all the memory and filled it with things like dentry data etc. The "right
> after boot" case is totally pointless, since a huge part of hugetlb
> entries is the ability to allocate those physically contiguous and
> well-aligned regions.
>
> Until then, it's just extra complexity for no actual gain.
>
> Oh, and while I'm at it, I want a pony too.

Unfortunately I wasn't able to find a pony on Google but here are some
huge page numbers if you're interested:

  http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html

I'm actually bit surprised you find the issue controversial, Linus. I
am not a real JVM hacker (although I could probably play one on TV)
but the "hugepages are a big win" argument seems pretty logical for
any GC heavy activity. Wouldn't be the first time I was wrong, though.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:46         ` Pekka Enberg
@ 2010-04-05 20:58           ` Linus Torvalds
  2010-04-05 21:54             ` Ingo Molnar
  2010-04-05 23:21             ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-05 20:58 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Mon, 5 Apr 2010, Pekka Enberg wrote:
> 
> Unfortunately I wasn't able to find a pony on Google but here are some
> huge page numbers if you're interested:

You missed the point.

Those numbers weren't done with the patches in question. They weren't done 
with the magic new code that can handle fragmentation and swapping. They 
are simply not relevant to any of the complex code under discussion.

The thing you posted is already doable (and done) using the existing hacky 
(but at least unsurprising) preallocation crud. We know that works. That's 
never been the issue.

What I'm asking for is this thing called "Does it actually work in 
REALITY". That's my point about "not just after a clean boot".

Just to really hit the issue home, here's my current machine:

	[root@i5 ~]# free
	             total       used       free     shared    buffers     cached
	Mem:       8073864    1808488    6265376          0      75480    1018412
	-/+ buffers/cache:     714596    7359268
	Swap:     10207228      12848   10194380

Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. 
Really. I've got 8GB in that machine, it's just not been doing much more 
than a few "git pull"s and "make allyesconfig" runs to check the current 
kernel and so it's got over 6GB free. 

So I'm bound to have _tons_ of 2M pages, no?

No. Lookie here:

	[344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB
	[344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB
	[344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB

just to help you parse that: this is a _lightly_ loaded machine. It's been 
up for about four days. And look at it.

In case you can't read it, the relevant part is this part:

	DMA: .. 1*2048kB 3*4096kB
	DMA32: .. 1*2048kB 0*4096kB
	Normal: .. 3*2048kB 0*4096kB

there is just a _small handful_ of 2MB pages. Seriously. On a machine with 
8 GB of RAM, and three quarters of it free, and there is just a couple of 
contiguous 2MB regions. Note, that's _MB_, not GB.

And don't tell me that these things are easy to fix. Don't tell me that 
the current VM is quite clean and can be harmlessly extended to deal with 
this all. Just don't. Not when we currently have a totally unexplained 
regression in the VM from the last scalability thing we did.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:32       ` Linus Torvalds
  2010-04-05 20:46         ` Pekka Enberg
@ 2010-04-05 21:01         ` Chris Mason
  2010-04-05 21:18           ` Avi Kivity
  1 sibling, 1 reply; 205+ messages in thread
From: Chris Mason @ 2010-04-05 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, Andrea Arcangeli,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Apr 2010, Pekka Enberg wrote:
> > 
> > AFAIK, most modern GCs split memory in young and old generation
> > "zones" and _copy_ surviving objects from the former to the latter if
> > their lifetime exceeds some threshold. The JVM keeps scanning the
> > smaller young generation very aggressively which causes TLB pressure
> > and scans the larger old generation less often.
> 
> .. my only input to this is: numbers talk, bullsh*t walks. 
> 
> I'm not interested in micro-benchmarks, either. I can show infinite TLB 
> walk improvement in a microbenchmark.

Ok, I'll bite.  I should be able to get some database workloads with
hugepages, transparent hugepages, and without any hugepages at all.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 21:01         ` Chris Mason
@ 2010-04-05 21:18           ` Avi Kivity
  2010-04-05 21:33             ` Linus Torvalds
  2010-04-06  8:30             ` Mel Gorman
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-05 21:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On 04/06/2010 12:01 AM, Chris Mason wrote:
> On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote:
>    
>>
>> On Mon, 5 Apr 2010, Pekka Enberg wrote:
>>      
>>> AFAIK, most modern GCs split memory in young and old generation
>>> "zones" and _copy_ surviving objects from the former to the latter if
>>> their lifetime exceeds some threshold. The JVM keeps scanning the
>>> smaller young generation very aggressively which causes TLB pressure
>>> and scans the larger old generation less often.
>>>        
>> .. my only input to this is: numbers talk, bullsh*t walks.
>>
>> I'm not interested in micro-benchmarks, either. I can show infinite TLB
>> walk improvement in a microbenchmark.
>>      
> Ok, I'll bite.  I should be able to get some database workloads with
> hugepages, transparent hugepages, and without any hugepages at all.
>    

Please run them in conjunction with Mel Gorman's memory compaction, 
otherwise fragmentation may prevent huge pages from being instantiated.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 21:18           ` Avi Kivity
@ 2010-04-05 21:33             ` Linus Torvalds
  2010-04-05 22:33               ` Chris Mason
  2010-04-06  8:30             ` Mel Gorman
  1 sibling, 1 reply; 205+ messages in thread
From: Linus Torvalds @ 2010-04-05 21:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Pekka Enberg, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura



On Tue, 6 Apr 2010, Avi Kivity wrote:
> 
> Please run them in conjunction with Mel Gorman's memory compaction, otherwise
> fragmentation may prevent huge pages from being instantiated.

.. and then please run them in conjunction with somebody doing "make -j16" 
on the kernel at the same time, or just generally doing real work for a 
few days before hand.

The point is, there are benchmarks, and then there is real life. If we 
_know_ some feature only works for benchmarks, it should be discounted as 
such. It's like a compiler that is tuned for specint - at some point the 
numbers lose a lot of their meaning.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:58           ` Linus Torvalds
@ 2010-04-05 21:54             ` Ingo Molnar
  2010-04-05 23:21             ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-05 21:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Andrew Morton, Andrea Arcangeli, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> there is just a _small handful_ of 2MB pages. Seriously. On a machine with 8 
> GB of RAM, and three quarters of it free, and there is just a couple of 
> contiguous 2MB regions. Note, that's _MB_, not GB.
>
> And don't tell me that these things are easy to fix. Don't tell me that the 
> current VM is quite clean and can be harmlessly extended to deal with this 
> all. Just don't. Not when we currently have a totally unexplained regression 
> in the VM from the last scalability thing we did.

I think those are very real worries.

The only point i wanted to make is that the numbers are real as well and go 
beyond what i saw characterised in the first email.

(It might still not be enough to tip the scale in the direction of 'we really 
want to do this' though.)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 21:33             ` Linus Torvalds
@ 2010-04-05 22:33               ` Chris Mason
  0 siblings, 0 replies; 205+ messages in thread
From: Chris Mason @ 2010-04-05 22:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Pekka Enberg, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Mon, Apr 05, 2010 at 02:33:29PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 Apr 2010, Avi Kivity wrote:
> > 
> > Please run them in conjunction with Mel Gorman's memory compaction, otherwise
> > fragmentation may prevent huge pages from being instantiated.
> 
> .. and then please run them in conjunction with somebody doing "make -j16" 
> on the kernel at the same time, or just generally doing real work for a 
> few days before hand.
> 
> The point is, there are benchmarks, and then there is real life. If we 
> _know_ some feature only works for benchmarks, it should be discounted as 
> such. It's like a compiler that is tuned for specint - at some point the 
> numbers lose a lot of their meaning.

Sure, I'll do my best to be brutal.  Avi, Andrea please fire off to me a
git tree or patch bomb for benchmarking.  Please include all the patches you
think it needs to go fast, including any config hints etc...

If you'd like numbers with and without a given set of patches, just let
me know.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 20:58           ` Linus Torvalds
  2010-04-05 21:54             ` Ingo Molnar
@ 2010-04-05 23:21             ` Andrea Arcangeli
  2010-04-06  0:26               ` Linus Torvalds
  2010-04-06  9:30               ` Mel Gorman
  1 sibling, 2 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-05 23:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Linus,

On Mon, Apr 05, 2010 at 01:58:57PM -0700, Linus Torvalds wrote:
> What I'm asking for is this thing called "Does it actually work in 
> REALITY". That's my point about "not just after a clean boot".
> 
> Just to really hit the issue home, here's my current machine:
> 
> 	[root@i5 ~]# free
> 	             total       used       free     shared    buffers     cached
> 	Mem:       8073864    1808488    6265376          0      75480    1018412
> 	-/+ buffers/cache:     714596    7359268
> 	Swap:     10207228      12848   10194380
> 
> Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. 
> Really. I've got 8GB in that machine, it's just not been doing much more 
> than a few "git pull"s and "make allyesconfig" runs to check the current 
> kernel and so it's got over 6GB free. 
> 
> So I'm bound to have _tons_ of 2M pages, no?
> 
> No. Lookie here:
> 
> 	[344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB
> 	[344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB
> 	[344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB
> 
> just to help you parse that: this is a _lightly_ loaded machine. It's been 
> up for about four days. And look at it.
> 
> In case you can't read it, the relevant part is this part:
> 
> 	DMA: .. 1*2048kB 3*4096kB
> 	DMA32: .. 1*2048kB 0*4096kB
> 	Normal: .. 3*2048kB 0*4096kB
> 
> there is just a _small handful_ of 2MB pages. Seriously. On a machine with 
> 8 GB of RAM, and three quarters of it free, and there is just a couple of 
> contiguous 2MB regions. Note, that's _MB_, not GB.

What I can provide is my current status so far on workstation:

$ free
             total       used       free     shared    buffers
             cached
Mem:       1923648    1410912     512736          0     332236
             391000
-/+ buffers/cache:     687676    1235972
Swap:      4200960      14204    4186756
$ cat /proc/buddyinfo 
Node 0, zone      DMA     46     34     30     12     16     11     10     5      0      1      0 
Node 0, zone    DMA32     33    355    352    129     46   1307    751   225      9      1      0 
$ uptime
 00:06:54 up 10 days,  5:10,  3 users,  load average: 0.00, 0.00, 0.00
$ grep Anon /proc/meminfo
AnonPages:         78036 kB
AnonHugePages:    100352 kB

And laptop:

$ free
             total       used       free     shared    buffers
             cached
Mem:       3076948    1964136    1112812          0      91920
             297212
-/+ buffers/cache:    1575004    1501944
Swap:      2939888      17668    2922220
$ cat /proc/buddyinfo
Node 0, zone      DMA     26      9      8      3      3      2      2     1      1      3      1
Node 0, zone    DMA32    840   2142   6455   5848   5156   2554    291    52     30      0      0
$ uptime
 00:08:21 up 17 days, 20:17,  5 users,  load average: 0.06, 0.01, 0.00
$ grep Anon /proc/meminfo 
AnonPages:        856332 kB
AnonHugePages:    272384 kB

this is with:

$ cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]
$ cat /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
[yes] no

Currently the "defrag" sysfs control only toggles __GFP_WAIT from
on/off in huge_memory.c (details in the patch with subject
"transparent hugepage core" in the alloc_hugepage()
function). Toggling __GFP_WAIT is a joke right now.

The real deal to address your worry is first to run "hugeadm
--set-recommended-min_free_kbytes" and to apply Mel's patches called
"memory compaction" which is a separate patchset.

I'm the consumer, Mel's the producer ;).

With virtual machines the host kernel doesn't need to live forever (it
has to be stable but we can easily reboot it without guest noticing),
we can migrate virtual machines to fresh booted new hosts voiding the
whole producer issue. Furthermore VM the first time are usually
started at host boot time, and we want as much memory as possible
backed by hugepages in the host.

This is not to say that the producer isn't important or can't work,
Mel posted number that shows it works, and we definitely want it to
work, but I'm just trying to make a point that a good consumer of
plenty of hugepages available at boot is useful even assuming the
producer won't ever work or won't ever get it (not the real life case
we're dealing with!).

Initially we're going to take advantage of only the consumer in
production exactly because it's already useful, even if we want to
take advantage of a smart runtime "producer" too later on as time goes
on. Migrating guests to produce hugepages isn't the ideal way for sure
and I'm very confident that Mel's work already filling the gap very
nicely.

The VM itself (regardless if the consumer is hugetlbfs or transparent
hugepage support) is evolving towards being able to generated endless
amount of hugepages (in 2M size, 1G still unthinkable because of the
huge cost) as shown by the already mainline available "hugeadm
--set-recommended-min_free_kbytes". BTW, I think having this 10 liner
algorithm in userland hugeadm binary is wrong and it should be a
separate sysctl like "echo 1
>/sys/kernel/vm/set-recommended-min_free_kbytes", but that's offtopic
and an implementation detail... This is just to show they are already
addressing that stuff for hugetlbfs. So I just created a better
consumer for the stuff they make an effort to produce anyway (i.e. 2M
pages). The better consumer we have of it in the kernel, the more
effort will be put into the producer.

> And don't tell me that these things are easy to fix. Don't tell me that 
> the current VM is quite clean and can be harmlessly extended to deal with 
> this all. Just don't. Not when we currently have a totally unexplained 
> regression in the VM from the last scalability thing we did.

Well the risk of regression with the consumer is little if disabled
with sysfs so it'd be trivial to localize if it caused any
problem. About memory compaction I think we should limit the
invocation of those new VM algorithms to hugetlbfs and transparent
hugepage support (and I already created the sysfs controls to
enable/disable those so you can run transparent hugepage support with
or without defrag feature). So all of this can be turned off at
runtime. You can run only the consumer, both consumer or producer, or
none (and if none, risk of regression should be zero). There's no
point to ever defrag if there is no consumer of 2M pages. khugepaged
should be able to invoke memory compaction comfortably in the defrag
job in the background if khugepaged/defrag is set to "yes".

I think worrying about the producer too much generates a chicken egg
problem, without an heavy consumer in mainline, there's little point
for people to work on the producer. Note that creating a good producer
wasn't easy task, I did all I could to keep it self contained and I
think I succeeded at that. My work as result created interest into
improving the producer on Mel's side. I am sure if the consumer goes
in, producing the stuff will also happen without much problems.

My preferred merging patch is to merge the consumer first. But then
I'm not entirely against the other order too. Merging both at the same
time to me looks unnecessary complexity merged in the kernel at the
same time and it'd make things less bisectable. But it wouldn't be
impossible either.

About the performance benefits I posted some numbers in linux-mm, but
I'll collect it here (and this is after boot with plenty of
hugepages). As a side note in this first part please note also the
boost in the page fault rate (but this really only for curiosity, as
this will only happen with hugepages are immediately available in the
buddy).

------------
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

[..]
Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============
-------------

This is a more interesting benchmark of kernel compile and some random
cpu bound dd command (not a microbenchmark like above):

-----------
This is a kernel build in a 2.6.31 guest, on a 2.6.34-rc1 host. KVM
run with "-drive cache=on,if=virtio,boot=on and -smp 4 -m 2g -vnc :0"
(host has 4G of ram). CPU is Phenom (not II) with NPT (4 cores, 1
die). All reads are provided from host cache and cpu overhead of the
I/O is reduced thanks to virtio. Workload is just a "make clean
>/dev/null; time make -j20 >/dev/null". Results copied by hand because
I logged through vnc.

real 4m12.498s
14m28.106s
1m26.721s

real 4m12.000s
14m27.850s
1m25.729s

After the benchmark:

grep Anon /proc/meminfo 
AnonPages:        121300 kB
AnonHugePages:   1007616 kB
cat /debugfs/kvm/largepages 
2296

1.6G free in guest and 1.5free in host.

Then on host:

# echo never > /sys//kernel/mm/transparent_hugepage/enabled 
# echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled 

then I restart the VM and re-run the same workload:

real 4m25.040s
user 15m4.665s
sys 1m50.519s

real 4m29.653s
user 15m8.637s
sys 1m49.631s

(guest kernel was not so recent and it had no transparent hugepage
support because gcc normally won't take advantage of hugepages
according to /proc/meminfo, so I made the comparison with a distro
guest kernel with my usual .config I use in kvm guests)

So guest compile the kernel 6% faster with hugepages and the results
are trivially reproducible and stable enough (especially with hugepage
enabled, without it varies from 4m24 sto 4m30s as I tried a few times
more without hugepages in NTP when userland wasn't patched yet...).

Below another test that takes advantage of hugepage in guest too, so
running the same 2.6.34-rc1 with transparent hugepage support in both
host and guest. (this really shows the power of KVM design, we boost
the hypervisor and we get double boost for guest applications)

Workload: time dd if=/dev/zero of=/dev/null bs=128M count=100

Host hugepage no guest: 3.898
Host hugepage guest hugepage: 3.966 (-1.17%)
Host no hugepage no guest: 4.088 (-4.87%)
Host hugepage guest no hugepage: 4.312 (-10.1%)
Host no hugepage guest hugepage: 4.388 (-12.5%)
Host no hugepage guest no hugepage: 4.425 (-13.5%)

Workload: time dd if=/dev/zero of=/dev/null bs=4M count=1000

Host hugepage no guest: 1.207
Host hugepage guest hugepage: 1.245 (-3.14%)
Host no hugepage no guest: 1.261 (-4.47%)
Host no hugepage guest no hugepage: 1.323 (-9.61%)
Host no hugepage guest hugepage: 1.371 (-13.5%)
Host no hugepage guest no hugepage: 1.398 (-15.8%)

I've no local EPT system to test so I may run them over vpn later on
some large EPT system (and surely there are better benchs than a silly
dd... but this is a start and shows even basic stuff gets the boost).

The above is basically an "home-workstation/laptop" coverage. I
(partly) intentionally run these on a system that has a ~$100 CPU and
~$50 motherboard, to show the absolute worst case, to be sure that
100% of home end users (running KVM) will take a measurable advantage
from this effort.

On huge systems the percentage boost is expected much bigger than on
the home-workstation above test of course.
--------------


Again gcc is a kind of worst case for it but it also shows a
definitive significant and reproducible boost.

Also note for a non-virtualization usage (so outside of
MADV_HUGEPAGE), invoking memory compaction synchronously is likely a
risk of losing CPU speed. khugepaged takes care of long lived
allocations of random tasks and the only thing to use memory
compaction synchronously could be the page faults of regions marked
MADV_HUGEPAGE. But we may only decide to invoke memory compaction
asynchronously and never as result of direct reclaim in process
context to avoid any latency to guest operations. All it matters after
boot is that khugepaged can do its job, it's not urgent. When things
are urgent migrating guests to a new cloud node is always possible.

I'd like to clarify this whole work has been done without ever making
assumptions about virtual machines, I tried to make this as
universally useful as possible (and not just because we want the exact
same VM algorithms to trim one level of guest pagetables too to get a
comulative boost so fully exploiting the KVM design ;). I'm thrilled
Chris is going to test a host-only test for database and I'm sure
willing to help with that.

Compacting everything that is "movable" is surely solvable from a
theoretical standpoint and that includes all anonymous memory (huge or
not) and all cache. That alone accounts for an huge bulk of the total
memory of a system, so being able to mix it all will result in the
best behavior which isn't possible to achieve with hugetlbfs (so if
the memory isn't allocated as anonymous memory can still be used as
cache for I/O). So in the very worst case, if everything else fails on
the producer front (again: not the case as far as I can tell!) what
should be reserved at boot is an amount of memory to limit the
unmovable parts there. And to leave the movable parts free to be
allocated dynamically without limitations depending on the workloads.

I'm quite sure Mel will be able to provide more details on his work
that has been reviewed in detail already on linux-mm with lots of
positive feedback which is why I expect zero problems on that side too
in real life (besides my theoretical standpoint in previous chapter ;).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 23:21             ` Andrea Arcangeli
@ 2010-04-06  0:26               ` Linus Torvalds
  2010-04-06  1:08                 ` [RFD] " Linus Torvalds
  2010-04-06  1:13                 ` Andrea Arcangeli
  2010-04-06  9:30               ` Mel Gorman
  1 sibling, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-06  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Tue, 6 Apr 2010, Andrea Arcangeli wrote:
>
> Some performance result:

Quite frankly, these "performance results" seem to be basically dishonest.

Judging by your numbers, the big win is apparently pre-populating the page 
tables, the "tlb miss" you quote seem to be almost in the noise. IOW, we 
have 

	memset page fault 1566023

vs

	memset page fault 2182476

looking like a major performance advantage, but then the actual usage is 
much less noticeable.

IOW, how much of the performance advantage would we get from a _much_ 
simpler patch to just much more aggressively pre-populate the page tables 
(especially for just anonymous pages, I assume) or even just fault pages 
in several at a time when you have lots of memory?

In particular, when you quote 6% improvement for a kernel compile, your 
own numbers make seriously wonder how many percentage points you'd get 
from just faulting in 8 pages at a time when you have lots of memory free, 
and use a single 3-order allocation to get those eight pages?

Would that already shrink the difference between those "memset page 
faults" by a factor of eight?

See what I'm saying?  

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  0:26               ` Linus Torvalds
@ 2010-04-06  1:08                 ` Linus Torvalds
  2010-04-06  1:26                   ` Andrea Arcangeli
  2010-04-06  1:35                   ` Linus Torvalds
  2010-04-06  1:13                 ` Andrea Arcangeli
  1 sibling, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-06  1:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Mon, 5 Apr 2010, Linus Torvalds wrote:
> 
> In particular, when you quote 6% improvement for a kernel compile, your 
> own numbers make [me] seriously wonder how many percentage points you'd get 
> from just faulting in 8 pages at a time when you have lots of memory free, 
> and use a single 3-order allocation to get those eight pages?

THIS PATCH IS TOTALLY UNTESTED!

It's very very unlikely to work, but it compiles for me at least in one 
particular configuration. So it must be perfect. Ship it.

It basically tries to just fill in anonymous memory PTE entries roughly 
one cacheline at a time, avoiding extra page-faults and extra memory 
allocations.

It's probably buggy as hell, I don't dare try to actually boot the crap I 
write. It literally started out as a pseudo-code patch that I then ended 
up expanding until it compiled and then fixed up some corner cases in. 

IOW, it's not really a serious patch, although when I look at it, it 
doesn't really look all that horrible.

Now, I'm pretty sure that allocating the page with a single order-3 
allocation, and then treating it as 8 individual order-0 pages is broken 
and probably makes various things unhappy. That "make_single_page()" 
monstrosity may or may not be sufficient.

In other words, what I'm trying to say is: treat this patch as a request 
for discussion, rather than something that necessarily _works_. 

			Linus

---
 include/linux/gfp.h |    3 ++
 mm/memory.c         |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c      |    9 ++++++
 3 files changed, 81 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..2b8f42b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -84,6 +84,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER_MOVABLE	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
+#define GFP_USER_ORDER	(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
 
 #ifdef CONFIG_NUMA
@@ -306,10 +307,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
diff --git a/mm/memory.c b/mm/memory.c
index 1d2ea39..7ad97cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2741,6 +2741,71 @@ out_release:
 	return ret;
 }
 
+static inline void make_single_page(struct page *page)
+{
+	set_page_count(page, 1);
+	set_page_private(page, 0);
+}
+
+/*
+ * See if we can optimistically fill eight pages at a time
+ */
+static int optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd)
+{
+	int i;
+	spinlock_t *ptl;
+	struct page *bigpage;
+
+	/* Don't even bother if it's not writable */
+	if (!(vma->vm_flags & VM_WRITE))
+		return 0;
+
+	/* Are we ok wrt the vma boundaries? */
+	if ((address & (PAGE_MASK << 3)) < vma->vm_start)
+		return 0;
+	if ((address | ~(PAGE_MASK << 3)) > vma->vm_end)
+		return 0;
+
+	/*
+	 * Round to a nice even 8-byte page boundary, and
+	 * optimistically (with no locking), check whether
+	 * it's all empty. Skip if we have it partly filled
+	 * in.
+	 *
+	 * 8 page table entries tends to be about a cacheline.
+	 */
+	page_table -= (address >> PAGE_SHIFT) & 7;
+	for (i = 0; i < 8; i++)
+		if (!pte_none(page_table[i]))
+			return 0;
+
+	/* Allocate the eight pages in one go, no warning or retrying */
+	bigpage = alloc_page_user_order(vma, addr, 3);
+	if (!bigpage)
+		return 0;
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	for (i = 0; i < 8; i++) {
+		struct page *page = bigpage + i;
+
+		make_single_page(page);
+		if (pte_none(page_table[i])) {
+			pte_t pte = mk_pte(page, vma->vm_page_prot);
+			pte = pte_mkwrite(pte_mkdirty(pte));
+			set_pte_at(mm, address, page_table+i, pte);
+		} else {
+			__free_page(page);
+		}
+	}
+
+	/* The caller will unlock */
+	return 1;
+}
+
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2754,6 +2819,9 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t entry;
 
+	if (optimistic_fault(mm, vma, address, page_table, pmd))
+		goto update;
+
 	if (!(flags & FAULT_FLAG_WRITE)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
 						vma->vm_page_prot));
@@ -2790,6 +2858,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
+update:
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, page_table);
 unlock:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 08f40a2..55a92bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 	return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
 }
 
+struct page *
+alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order)
+{
+	struct zonelist *zl = policy_zonelist(gfp, pol);
+	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+	return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol);
+}
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  0:26               ` Linus Torvalds
  2010-04-06  1:08                 ` [RFD] " Linus Torvalds
@ 2010-04-06  1:13                 ` Andrea Arcangeli
  2010-04-06  1:38                   ` Linus Torvalds
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-06  1:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 05, 2010 at 05:26:15PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 6 Apr 2010, Andrea Arcangeli wrote:
> >
> > Some performance result:
> 
> Quite frankly, these "performance results" seem to be basically dishonest.
> 
> Judging by your numbers, the big win is apparently pre-populating the page 
> tables, the "tlb miss" you quote seem to be almost in the noise. IOW, we 
> have 
> 
> 	memset page fault 1566023
> 
> vs
> 
> 	memset page fault 2182476
> 
> looking like a major performance advantage, but then the actual usage is 
> much less noticeable.
> 
> IOW, how much of the performance advantage would we get from a _much_ 
> simpler patch to just much more aggressively pre-populate the page tables 
> (especially for just anonymous pages, I assume) or even just fault pages 
> in several at a time when you have lots of memory?

I had a prefaulting patch that also allocated an hugepage but only
mapped it with 2 ptes, 4 ptes, 8 ptes, up to 256ptes using a sysctl,
until the memset faulted in the rest and that triggered another chunk
of prefault on the reamining hugepage. In the end these weren't worth
it so I went stright with huge pmd immediately (even if initially I
worried about the more intensive clear-page in cow), which is hugely
simpler too and doesn't only provide a page fault advantage.

> In particular, when you quote 6% improvement for a kernel compile, your 

The memset test you mention above was run on host. The kernel compile
is run on guest with an unmodified guest kernel. The kernel compile
isn't mangling pagetables differently. The kernel compile is run on
two different host kernels: one running with transparent hugepages one
without, the guest kernel has no modifications at all. No page fault
ever happens in the host, only gcc runs in the guest in an unmodified
kernel that isn't using hugepages at all.

> own numbers make seriously wonder how many percentage points you'd get 
> from just faulting in 8 pages at a time when you have lots of memory free, 
> and use a single 3-order allocation to get those eight pages?
> 
> Would that already shrink the difference between those "memset page 
> faults" by a factor of eight?
> 
> See what I'm saying?  

I see what you're saying but that has nothing to do with the 6% boost.

In short I first measured the page fault improvement in host (~+50%
faster, sure that has nothing to do with pmd_huge or the tlb miss, I
said I mentioned it just for curiosity in fact), then measured the tlb
miss improvement in host (a few percent faster as usual with
hugetlbfs) then measured the boost in guest if host uses hugepages
(with no guest kernel change at all, just the tlb miss going faster in
guest and that boosts the guest kernel compile 6%) and then some other
test with dd with all combinations of host/guest using hugepages or
not, and also with dd run on bare metal with or without hugepages.

As said gcc is a sort of worst case, so you can assume any guest math
will run 6% faster or more in guest if the host runs with transparent
hugepages enabled (and there's memory compaction etc).

The page fault speedup is a "nice addon" that has nothing to do with
the kernel compile improvement because it was repeated many times and
the guest kernel memory was already faulted in before. I only wanted
to point it out "for curiosity" as I wrote in the prev email.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  1:08                 ` [RFD] " Linus Torvalds
@ 2010-04-06  1:26                   ` Andrea Arcangeli
  2010-04-06  1:35                   ` Linus Torvalds
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-06  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 05, 2010 at 06:08:51PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Apr 2010, Linus Torvalds wrote:
> > 
> > In particular, when you quote 6% improvement for a kernel compile, your 
> > own numbers make [me] seriously wonder how many percentage points you'd get 
> > from just faulting in 8 pages at a time when you have lots of memory free, 
> > and use a single 3-order allocation to get those eight pages?
> 
> THIS PATCH IS TOTALLY UNTESTED!
> 
> It's very very unlikely to work, but it compiles for me at least in one 
> particular configuration. So it must be perfect. Ship it.
> 
> It basically tries to just fill in anonymous memory PTE entries roughly 
> one cacheline at a time, avoiding extra page-faults and extra memory 
> allocations.
> 
> It's probably buggy as hell, I don't dare try to actually boot the crap I 
> write. It literally started out as a pseudo-code patch that I then ended 
> up expanding until it compiled and then fixed up some corner cases in. 
> 
> IOW, it's not really a serious patch, although when I look at it, it 
> doesn't really look all that horrible.
> 
> Now, I'm pretty sure that allocating the page with a single order-3 
> allocation, and then treating it as 8 individual order-0 pages is broken 
> and probably makes various things unhappy. That "make_single_page()" 
> monstrosity may or may not be sufficient.
> 
> In other words, what I'm trying to say is: treat this patch as a request 
> for discussion, rather than something that necessarily _works_. 

This will provide 0% speedup to a kernel compile in guest where
transparent hugepage support (or hugetlbfs too) would provide a 6%
speedup.

I evaluated the prefault approach before I finalized my design and
then generated an huge pmd when the whole hugepage was mapped. It's
all worthless complexity in my view.

In fact except at boot time we'll likely won't be interested to take
advantage of this, as it is not a free optimization and it magnifies
the time it takes to clear-page copy-page (which is why I tried to try
to only prefault an hugepages, and then after benchmarking I figured
out it wasn't worth it and it'd be hugely more complicated too). The
only case it is worth mapping more than one 4k page, is when we can
take advantage of the tlb miss speedup and of the 2M tlb, otherwise
it's better to stick to 4k page faults and do a 4k clear-page
copy-page and not risk to take more than 4k of memory. And let
khugepaged do the rest.

I think I already mentioned it in the previous email but seeing your
patch I feel obliged to re-post:

---------------
hugepages in the virtualization hypervisor (and also in the guest!)
are much more important than in a regular host not using
virtualization, becasue with NPT/EPT they decrease the tlb-miss
cacheline accesses from 24 to 19 in case only the hypervisor uses
transparent hugepages, and they decrease the tlb-miss cacheline
accesses from 19 to 15 in case both the linux hypervisor and the linux
guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is
that the tlb miss handler is much slower on a NPT/EPT guest than for a
regular shadow paging or no-virtualization scenario. So maximizing the
amount of virtual memory cached by the TLB pays off significantly more
with NPT/EPT than without (even if there would be no significant
speedup in the tlb-miss runtime).
----------------

This is in the changelog of the "transparent hugepage core" patch too
and here as well:

http://linux-mm.org/TransparentHugepage?action=AttachFile&do=get&target=transparent-hugepage.pdf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [RFD] Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  1:08                 ` [RFD] " Linus Torvalds
  2010-04-06  1:26                   ` Andrea Arcangeli
@ 2010-04-06  1:35                   ` Linus Torvalds
  1 sibling, 0 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-06  1:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Mon, 5 Apr 2010, Linus Torvalds wrote:
> 
> THIS PATCH IS TOTALLY UNTESTED!

Ok, it was also crap. I tried to warn you. We actually have that 
"split_page()" function that does the right thing, I don't know why I 
didn't realize that.

And the lock was uninitialized for the optimistic case, because I had made 
that "clever optimization" to let the caller do the unlocking in the 
common path, but when I did that I didn't actually make sure that the 
caller had the right lock. Whee.

I'm a moron.

This is _still_ untested and probably horribly buggy, but at least it 
isn't *quite* as rough as the previous patch was.

		Linus

---
 include/linux/gfp.h |    3 ++
 mm/memory.c         |   65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c      |    9 +++++++
 3 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..2b8f42b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -84,6 +84,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER_MOVABLE	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
+#define GFP_USER_ORDER	(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
 
 #ifdef CONFIG_NUMA
@@ -306,10 +307,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
diff --git a/mm/memory.c b/mm/memory.c
index 1d2ea39..4f1521e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2742,6 +2742,66 @@ out_release:
 }
 
 /*
+ * See if we can optimistically fill eight pages at a time
+ */
+static spinlock_t *optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd)
+{
+	int i;
+	spinlock_t *ptl;
+	struct page *bigpage;
+
+	/* Don't even bother if it's not writable */
+	if (!(vma->vm_flags & VM_WRITE))
+		return NULL;
+
+	/* Are we ok wrt the vma boundaries? */
+	if ((address & (PAGE_MASK << 3)) < vma->vm_start)
+		return NULL;
+	if ((address | ~(PAGE_MASK << 3)) > vma->vm_end)
+		return NULL;
+
+	/*
+	 * Round to a nice even 8-byte page boundary, and
+	 * optimistically (with no locking), check whether
+	 * it's all empty. Skip if we have it partly filled
+	 * in.
+	 *
+	 * 8 page table entries tends to be about a cacheline.
+	 */
+	page_table -= (address >> PAGE_SHIFT) & 7;
+	for (i = 0; i < 8; i++)
+		if (!pte_none(page_table[i]))
+			return NULL;
+
+	/* Allocate the eight pages in one go, no warning or retrying */
+	bigpage = alloc_page_user_order(vma, addr, 3);
+	if (!bigpage)
+		return NULL;
+
+	split_page(bigpage, 3);
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	for (i = 0; i < 8; i++) {
+		struct page *page = bigpage + i;
+
+		if (pte_none(page_table[i])) {
+			pte_t pte = mk_pte(page, vma->vm_page_prot);
+			pte = pte_mkwrite(pte_mkdirty(pte));
+			set_pte_at(mm, address, page_table+i, pte);
+		} else {
+			__free_page(page);
+		}
+	}
+
+	/* The caller will unlock */
+	return ptl;
+}
+
+
+/*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -2754,6 +2814,10 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t entry;
 
+	ptl = optimistic_fault(mm, vma, address, page_table, pmd);
+	if (ptl)
+		goto update;
+
 	if (!(flags & FAULT_FLAG_WRITE)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
 						vma->vm_page_prot));
@@ -2790,6 +2854,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
+update:
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, page_table);
 unlock:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 08f40a2..55a92bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 	return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
 }
 
+struct page *
+alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order)
+{
+	struct zonelist *zl = policy_zonelist(gfp, pol);
+	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+	return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol);
+}
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  1:13                 ` Andrea Arcangeli
@ 2010-04-06  1:38                   ` Linus Torvalds
  2010-04-06  2:23                     ` Linus Torvalds
  0 siblings, 1 reply; 205+ messages in thread
From: Linus Torvalds @ 2010-04-06  1:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Tue, 6 Apr 2010, Andrea Arcangeli wrote:
>
> In short I first measured the page fault improvement in host (~+50%
> faster, sure that has nothing to do with pmd_huge or the tlb miss, I
> said I mentioned it just for curiosity in fact), then measured the tlb
> miss improvement in host (a few percent faster as usual with
> hugetlbfs) then measured the boost in guest if host uses hugepages
> (with no guest kernel change at all, just the tlb miss going faster in
> guest and that boosts the guest kernel compile 6%) and then some other
> test with dd with all combinations of host/guest using hugepages or
> not, and also with dd run on bare metal with or without hugepages.

Yeah, sorry. I misread your email - I noticed that 6% improvement for 
something that looked like a workload I might actually _care_ about, and 
didn't track the context enough to notice that it was just for the "host 
is using hugepages" case.

So I thought it was a more interesting load than it was. The 
virtualization "TLB miss is expensive" load I can't find it in myself to 
care about. "Get a better CPU" is my answer to that one,

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  1:38                   ` Linus Torvalds
@ 2010-04-06  2:23                     ` Linus Torvalds
  2010-04-06  5:25                       ` Nick Piggin
                                         ` (2 more replies)
  0 siblings, 3 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-06  2:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura



On Mon, 5 Apr 2010, Linus Torvalds wrote:
> 
> So I thought it was a more interesting load than it was. The 
> virtualization "TLB miss is expensive" load I can't find it in myself to 
> care about. "Get a better CPU" is my answer to that one,

[ Btw, I do realize that "better CPU" in this case may be "future CPU". I 
  just think that this is where better TLB's and using ASID's etc is 
  likely to be a much bigger deal than adding VM complexity. Kind of the 
  same way I think HIGHMEM was ultimately a failure, and the 4G:4G split 
  was an atrocity that should have been killed ]

Anyway. Since the prefaulting wasn't the point, I'm killing the patch. But 
since I actually tested it, and then I made it work, here's something that 
I will hereby throw away, but maybe somebody else would like to play with. 
It still gets the memcg accounting wrong, but it actually does seem to 
boot for me.

And it just might make page faults cheaper. We avoid the whole "drop the 
ptl and re-take it" for the optimistic case, for example. So maybe it is 
worth looking at, even though the 6% thing wasn't here.

		Linus

---
 include/linux/gfp.h |    4 ++
 mm/memory.c         |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c      |    9 +++++
 3 files changed, 95 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..1b94d09 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -84,6 +84,8 @@ struct vm_area_struct;
 #define GFP_HIGHUSER_MOVABLE	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
+#define GFP_USER_ORDER	(GFP_NOWAIT | __GFP_HARDWALL | __GFP_NOWARN | __GFP_NORETRY | \
+			 __GFP_HIGHMEM | __GFP_MOVABLE | __GFP_ZERO)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
 
 #ifdef CONFIG_NUMA
@@ -306,10 +308,12 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_user_order(struct vm_area_struct *, unsigned long, int);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_user_order(vma, addr, order) alloc_pages(GFP_USER_ORDER, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
diff --git a/mm/memory.c b/mm/memory.c
index 1d2ea39..b2d5025 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2742,6 +2742,83 @@ out_release:
 }
 
 /*
+ * See if we can optimistically fill eight pages at a time
+ */
+static spinlock_t *optimistic_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd)
+{
+	int i;
+	spinlock_t *ptl;
+	struct page *bigpage;
+
+	/* Don't even bother if it's not writable */
+	if (!(vma->vm_flags & VM_WRITE))
+		return NULL;
+
+	/*
+	 * The optimistic path doesn't want to drop the
+	 * page table map, so it can't allocate anon_vma's
+	 * etc.
+	 */
+	if (!vma->anon_vma)
+		return NULL;
+
+	/* Are we ok wrt the vma boundaries? */
+	if ((address & (PAGE_MASK << 3)) < vma->vm_start)
+		return NULL;
+	if ((address | ~(PAGE_MASK << 3)) > vma->vm_end)
+		return NULL;
+
+	/*
+	 * Round to a nice even 8-byte page boundary, and
+	 * optimistically (with no locking), check whether
+	 * it's all empty. Skip if we have it partly filled
+	 * in.
+	 *
+	 * 8 page table entries tends to be about a cacheline.
+	 */
+	page_table -= (address >> PAGE_SHIFT) & 7;
+	for (i = 0; i < 8; i++)
+		if (!pte_none(page_table[i]))
+			return NULL;
+
+	/* Allocate the eight pages in one go, no warning or retrying */
+	bigpage = alloc_page_user_order(vma, addr, 3);
+	if (!bigpage)
+		return NULL;
+
+	split_page(bigpage, 3);
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	address &= PAGE_MASK << 3;
+	for (i = 0; i < 8; i++) {
+		struct page *page = bigpage + i;
+
+		if (pte_none(page_table[i])) {
+			pte_t pte;
+
+			__SetPageUptodate(page);
+
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			page_add_new_anon_rmap(page, vma, address);
+
+			pte = mk_pte(page, vma->vm_page_prot);
+			pte = pte_mkwrite(pte_mkdirty(pte));
+			set_pte_at(mm, address, page_table+i, pte);
+		} else {
+			__free_page(page);
+		}
+		address += PAGE_SIZE;
+	}
+
+	/* The caller will unlock */
+	return ptl;
+}
+
+
+/*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -2754,6 +2831,10 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t entry;
 
+	ptl = optimistic_fault(mm, vma, address, page_table, pmd);
+	if (ptl)
+		goto update;
+
 	if (!(flags & FAULT_FLAG_WRITE)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
 						vma->vm_page_prot));
@@ -2790,6 +2871,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
+update:
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, page_table);
 unlock:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 08f40a2..55a92bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1707,6 +1707,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 	return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
 }
 
+struct page *
+alloc_page_user_order(struct vm_area_struct *vma, unsigned long addr, int order)
+{
+	struct zonelist *zl = policy_zonelist(gfp, pol);
+	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+	return __alloc_pages_nodemask(GFP_USER_ORDER, order, zl, pol);
+}
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  2:23                     ` Linus Torvalds
@ 2010-04-06  5:25                       ` Nick Piggin
  2010-04-06  9:08                       ` Ingo Molnar
  2010-04-06  9:55                       ` Avi Kivity
  2 siblings, 0 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-06  5:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 05, 2010 at 07:23:44PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Apr 2010, Linus Torvalds wrote:
> > 
> > So I thought it was a more interesting load than it was. The 
> > virtualization "TLB miss is expensive" load I can't find it in myself to 
> > care about. "Get a better CPU" is my answer to that one,
> 
> [ Btw, I do realize that "better CPU" in this case may be "future CPU". I 
>   just think that this is where better TLB's and using ASID's etc is 
>   likely to be a much bigger deal than adding VM complexity. Kind of the 
>   same way I think HIGHMEM was ultimately a failure, and the 4G:4G split 
>   was an atrocity that should have been killed ]

It's an interesting route to go down. With more and more virtualization,
we start to think about HV platforms as more legitimate targets for
large scale optimizations like this. On the other hand, hardware memory
virtualization is still quite young on x86 CPUs and there are still
hardware improvements down the line.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 21:18           ` Avi Kivity
  2010-04-05 21:33             ` Linus Torvalds
@ 2010-04-06  8:30             ` Mel Gorman
  2010-04-06 11:35               ` Chris Mason
  1 sibling, 1 reply; 205+ messages in thread
From: Mel Gorman @ 2010-04-06  8:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Mason, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 12:18:24AM +0300, Avi Kivity wrote:
> On 04/06/2010 12:01 AM, Chris Mason wrote:
>> On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote:
>>    
>>>
>>> On Mon, 5 Apr 2010, Pekka Enberg wrote:
>>>      
>>>> AFAIK, most modern GCs split memory in young and old generation
>>>> "zones" and _copy_ surviving objects from the former to the latter if
>>>> their lifetime exceeds some threshold. The JVM keeps scanning the
>>>> smaller young generation very aggressively which causes TLB pressure
>>>> and scans the larger old generation less often.
>>>>        
>>> .. my only input to this is: numbers talk, bullsh*t walks.
>>>
>>> I'm not interested in micro-benchmarks, either. I can show infinite TLB
>>> walk improvement in a microbenchmark.
>>>      
>> Ok, I'll bite.  I should be able to get some database workloads with
>> hugepages, transparent hugepages, and without any hugepages at all.
>>    
>
> Please run them in conjunction with Mel Gorman's memory compaction,  
> otherwise fragmentation may prevent huge pages from being instantiated.
>

Strictly speaking, compaction is not necessary to allocate huge pages.
What compaction gets you is

  o Lower latency and cost of huge page allocation
  o Works on swapless systems

What is important is that you run
hugeadm --set-recommended-min_free_kbytes
from the libhugetlbfs 2.8 package early in boot so that
anti-fragmentation is doing as good as job as possible. If one is very
curious, use the mm_page_alloc_extfrag to trace how often severe
fragmentation-related events occur under default settings and with
min_free_kbytes set properly.

Without the compaction patches, allocating huge pages will be occasionally
*very* expensive as a large number of pages will need to be reclaimed.
Most likely sympton is trashing while the database starts up. Allocation
success rates will also be lower when under heavy load.

Running make -j16 at the same time is unlikely to make much of a
difference from a hugepage allocation point of view. The performance
figures will vary significantly of course as make competes with the
database for CPU time and other resources.

Finally, benchmarking with databases is not new as such -
http://lwn.net/Articles/378641/ . This was on fairly simple hardware
though as I didn't have access to hardware more suitable for database
workloads. If you are running with transparent huge pages though, be
sure to double check that huge pages are actually being used
transparently.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  2:23                     ` Linus Torvalds
  2010-04-06  5:25                       ` Nick Piggin
@ 2010-04-06  9:08                       ` Ingo Molnar
  2010-04-06  9:13                         ` Ingo Molnar
  2010-04-10 18:47                         ` Andrea Arcangeli
  2010-04-06  9:55                       ` Avi Kivity
  2 siblings, 2 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-06  9:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 5 Apr 2010, Linus Torvalds wrote:
> > 
> > So I thought it was a more interesting load than it was. The 
> > virtualization "TLB miss is expensive" load I can't find it in myself to 
> > care about. "Get a better CPU" is my answer to that one,
> 
> [ Btw, I do realize that "better CPU" in this case may be "future CPU". I 
>   just think that this is where better TLB's and using ASID's etc is 
>   likely to be a much bigger deal than adding VM complexity. Kind of the 
>   same way I think HIGHMEM was ultimately a failure, and the 4G:4G split 
>   was an atrocity that should have been killed ]

Both highmem and 4g:4g were failures (albeit highly practical failures you 
have to admit) in the sense that their relevance faded over time. (because 
they extended the practical limits of the constantly fading, 32-bit world.)

Both highmem and 4g:4g became less and less of an issue as hardware improved.

OTOH are you saying the same thing about huge pages? On what basis? Do you 
think it would be possible for hardware to 'discover' physically-continuous 2M 
mappings and turn them into a huge TLB internally? [i'm not sure it's feasible 
even in future CPUs - and even if it is, the OS would still have to do the 
defrag and keep-them-2MB logic internally so there's not much difference.]

The numbers seem rather clear:

  http://lwn.net/Articles/378641/

Yes, some of it is benchmarketing (most benchmarks are), but a significant 
portion of it isnt: HPC processing, DB workloads and Java workloads.

Hugepages provide a 'final' performance boost in cases where there's no other 
software way left to speed up a given workload.

The goal of Andrea's and Mel's patch-set, to make this 'final performance 
boost' more practical seems like a valid technical goal.

We can still validly reject it all based on VM complexity (albeit the VM 
people wrote both the defrag part and the transparent usage part so all the 
patches are all real), but how can we legitimately reject the performance 
advantage?

I think the hugetlb situation is more similar to the block IO transition to 
larger sector sizes in block IO or to the networking IO transition from 
host-side-everything to checksum-offload and then to TSO - than it is similar 
to highmem or 4g:4g.

In fact the whole maintenance thought process seems somewhat similar to the 
TSO situation: the networking folks first rejected TSO based on complexity 
arguments, but then was embraced after some time.

 	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  9:08                       ` Ingo Molnar
@ 2010-04-06  9:13                         ` Ingo Molnar
  2010-04-10 18:47                         ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-06  9:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Ingo Molnar <mingo@elte.hu> wrote:

> The numbers seem rather clear:
> 
>   http://lwn.net/Articles/378641/
> 
> Yes, some of it is benchmarketing (most benchmarks are), but a significant 
> portion of it isnt: HPC processing, DB workloads and Java workloads.

( I forgot to mention virtualization - but i guess we can leave that out of
  the list as uninteresting-for-now. )

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-05 23:21             ` Andrea Arcangeli
  2010-04-06  0:26               ` Linus Torvalds
@ 2010-04-06  9:30               ` Mel Gorman
  2010-04-06 10:32                 ` Theodore Tso
  1 sibling, 1 reply; 205+ messages in thread
From: Mel Gorman @ 2010-04-06  9:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 01:21:15AM +0200, Andrea Arcangeli wrote:
> Hi Linus,
> 
> On Mon, Apr 05, 2010 at 01:58:57PM -0700, Linus Torvalds wrote:
> > What I'm asking for is this thing called "Does it actually work in 
> > REALITY". That's my point about "not just after a clean boot".
> > 
> > Just to really hit the issue home, here's my current machine:
> > 
> > 	[root@i5 ~]# free
> > 	             total       used       free     shared    buffers     cached
> > 	Mem:       8073864    1808488    6265376          0      75480    1018412
> > 	-/+ buffers/cache:     714596    7359268
> > 	Swap:     10207228      12848   10194380
> > 
> > Look, I have absolutely _sh*tloads_ of memory, and I'm not using it. 
> > Really. I've got 8GB in that machine, it's just not been doing much more 
> > than a few "git pull"s and "make allyesconfig" runs to check the current 
> > kernel and so it's got over 6GB free. 
> > 
> > So I'm bound to have _tons_ of 2M pages, no?
> > 
> > No. Lookie here:
> > 
> > 	[344492.280001] DMA: 1*4kB 1*8kB 1*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15836kB
> > 	[344492.280020] DMA32: 17516*4kB 19497*8kB 18318*16kB 15195*32kB 10332*64kB 5163*128kB 1371*256kB 123*512kB 2*1024kB 1*2048kB 0*4096kB = 2745528kB
> > 	[344492.280027] Normal: 57295*4kB 66959*8kB 39639*16kB 29486*32kB 10483*64kB 2366*128kB 398*256kB 100*512kB 27*1024kB 3*2048kB 0*4096kB = 3503268kB
> > 
> > just to help you parse that: this is a _lightly_ loaded machine. It's been 
> > up for about four days. And look at it.
> > 
> > In case you can't read it, the relevant part is this part:
> > 
> > 	DMA: .. 1*2048kB 3*4096kB
> > 	DMA32: .. 1*2048kB 0*4096kB
> > 	Normal: .. 3*2048kB 0*4096kB
> > 
> > there is just a _small handful_ of 2MB pages. Seriously. On a machine with 
> > 8 GB of RAM, and three quarters of it free, and there is just a couple of 
> > contiguous 2MB regions. Note, that's _MB_, not GB.
> 

The kernel you are using is presumably fairly recent so it has
anti-fragmentation app[lied.

The point of anti-frag is not to keep fragmentation low at all times but to
have the system in a state where fragmentation can be dealt with.  Hence,
buddyinfo is rarely useful for figuring out "how many huge pages can I
allocate?" In the past when I was measuring fragmentation at a given time,
I used both buddyinfo and /proc/kpageflags to check the state of the system.

There is a good chance you could allocate a decent percentage of
memory as huge pages but as you are unlikely to have run hugeadm
--set-recommended-min_free_kbytes early in boot, it is also likely to trash
heavily and the success rates will not be very impressive.

The min_free_kbytes is really important. In the past I've used the
mm_page_alloc_extfrag to measure its effect. With default settings, under
heavy loads, the event would trigger hundreds of thousands of times. With
set-recommended-min_free_kbytes, it would trigger tens or maybe hundreds of
times under the same situations and the bulk of those events were not severe.

> What I can provide is my current status so far on workstation:
> 
> $ free
>              total       used       free     shared    buffers
>              cached
> Mem:       1923648    1410912     512736          0     332236
>              391000
> -/+ buffers/cache:     687676    1235972
> Swap:      4200960      14204    4186756
> $ cat /proc/buddyinfo 
> Node 0, zone      DMA     46     34     30     12     16     11     10     5      0      1      0 
> Node 0, zone    DMA32     33    355    352    129     46   1307    751   225      9      1      0 
> $ uptime
>  00:06:54 up 10 days,  5:10,  3 users,  load average: 0.00, 0.00, 0.00
> $ grep Anon /proc/meminfo
> AnonPages:         78036 kB
> AnonHugePages:    100352 kB
> 
> And laptop:
> 
> $ free
>              total       used       free     shared    buffers
>              cached
> Mem:       3076948    1964136    1112812          0      91920
>              297212
> -/+ buffers/cache:    1575004    1501944
> Swap:      2939888      17668    2922220
> $ cat /proc/buddyinfo
> Node 0, zone      DMA     26      9      8      3      3      2      2     1      1      3      1
> Node 0, zone    DMA32    840   2142   6455   5848   5156   2554    291    52     30      0      0
> $ uptime
>  00:08:21 up 17 days, 20:17,  5 users,  load average: 0.06, 0.01, 0.00
> $ grep Anon /proc/meminfo 
> AnonPages:        856332 kB
> AnonHugePages:    272384 kB
> 
> this is with:
> 
> $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always madvise [never]
> $ cat /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
> [yes] no
> 
> Currently the "defrag" sysfs control only toggles __GFP_WAIT from
> on/off in huge_memory.c (details in the patch with subject
> "transparent hugepage core" in the alloc_hugepage()
> function). Toggling __GFP_WAIT is a joke right now.
> 
> The real deal to address your worry is first to run "hugeadm
> --set-recommended-min_free_kbytes" and to apply Mel's patches called
> "memory compaction" which is a separate patchset.
> 

The former is critical, the latter is not strictly necessary but it will
reduce the cost of hugepage allocation significantly, increases the success
rates slightly when under load and will work on swapless systems. It's worth
applying both but transparent hugepage support also stands on its own.

> I'm the consumer, Mel's the producer ;).
> 
> With virtual machines the host kernel doesn't need to live forever (it
> has to be stable but we can easily reboot it without guest noticing),
> we can migrate virtual machines to fresh booted new hosts voiding the
> whole producer issue. Furthermore VM the first time are usually
> started at host boot time, and we want as much memory as possible
> backed by hugepages in the host.
> 
> This is not to say that the producer isn't important or can't work,
> Mel posted number that shows it works, and we definitely want it to
> work, but I'm just trying to make a point that a good consumer of
> plenty of hugepages available at boot is useful even assuming the
> producer won't ever work or won't ever get it (not the real life case
> we're dealing with!).
> 

Most recent figures on huge page allocation under load are at
http://lkml.org/lkml/2010/4/2/146. It includes data on the hugepage
allocation latency on vanilla kernels and without compaction.

> Initially we're going to take advantage of only the consumer in
> production exactly because it's already useful, even if we want to
> take advantage of a smart runtime "producer" too later on as time goes
> on. Migrating guests to produce hugepages isn't the ideal way for sure
> and I'm very confident that Mel's work already filling the gap very
> nicely.
> 
> The VM itself (regardless if the consumer is hugetlbfs or transparent
> hugepage support) is evolving towards being able to generated endless
> amount of hugepages (in 2M size, 1G still unthinkable because of the
> huge cost) as shown by the already mainline available "hugeadm
> --set-recommended-min_free_kbytes". BTW, I think having this 10 liner
> algorithm in userland hugeadm binary is wrong and it should be a
> separate sysctl like "echo 1
> >/sys/kernel/vm/set-recommended-min_free_kbytes", but that's offtopic
> and an implementation detail... This is just to show they are already
> addressing that stuff for hugetlbfs. So I just created a better
> consumer for the stuff they make an effort to produce anyway (i.e. 2M
> pages). The better consumer we have of it in the kernel, the more
> effort will be put into the producer.
> 
> > And don't tell me that these things are easy to fix. Don't tell me that 
> > the current VM is quite clean and can be harmlessly extended to deal with 
> > this all. Just don't. Not when we currently have a totally unexplained 
> > regression in the VM from the last scalability thing we did.
> 
> Well the risk of regression with the consumer is little if disabled
> with sysfs so it'd be trivial to localize if it caused any
> problem. About memory compaction I think we should limit the
> invocation of those new VM algorithms to hugetlbfs and transparent
> hugepage support (and I already created the sysfs controls to
> enable/disable those so you can run transparent hugepage support with
> or without defrag feature).

This effectively happens with the compaction patches as of V7. It only
triggers for orders > PAGE_ALLOC_COSTLY_ORDER which in practice is
mostly hugetlbfs with an occasional bit of madness from a very small
number of devices.

> So all of this can be turned off at
> runtime. You can run only the consumer, both consumer or producer, or
> none (and if none, risk of regression should be zero). There's no
> point to ever defrag if there is no consumer of 2M pages. khugepaged
> should be able to invoke memory compaction comfortably in the defrag
> job in the background if khugepaged/defrag is set to "yes".
> 
> I think worrying about the producer too much generates a chicken egg
> problem, without an heavy consumer in mainline, there's little point
> for people to work on the producer.

The other producer I have in mind for compaction in particular is huge
page allocation at runtime on swapless systems. hugeadm has the feature
of temporarily adding swap while it resizes the pool and while it works,
it's less than ideal because it still requires a local disk. KVM using
it for virtual guests would be a heavier user.

> Note that creating a good producer
> wasn't easy task, I did all I could to keep it self contained and I
> think I succeeded at that. My work as result created interest into
> improving the producer on Mel's side. I am sure if the consumer goes
> in, producing the stuff will also happen without much problems.
> 
> My preferred merging patch is to merge the consumer first. But then
> I'm not entirely against the other order too. Merging both at the same
> time to me looks unnecessary complexity merged in the kernel at the
> same time and it'd make things less bisectable. But it wouldn't be
> impossible either.
> 
> About the performance benefits I posted some numbers in linux-mm, but
> I'll collect it here (and this is after boot with plenty of
> hugepages). As a side note in this first part please note also the
> boost in the page fault rate (but this really only for curiosity, as
> this will only happen with hugepages are immediately available in the
> buddy).
> 
> ------------
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> [..]
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> -------------
> 
> This is a more interesting benchmark of kernel compile and some random
> cpu bound dd command (not a microbenchmark like above):
> 
> -----------
> This is a kernel build in a 2.6.31 guest, on a 2.6.34-rc1 host. KVM
> run with "-drive cache=on,if=virtio,boot=on and -smp 4 -m 2g -vnc :0"
> (host has 4G of ram). CPU is Phenom (not II) with NPT (4 cores, 1
> die). All reads are provided from host cache and cpu overhead of the
> I/O is reduced thanks to virtio. Workload is just a "make clean
> >/dev/null; time make -j20 >/dev/null". Results copied by hand because
> I logged through vnc.
> 
> real 4m12.498s
> 14m28.106s
> 1m26.721s
> 
> real 4m12.000s
> 14m27.850s
> 1m25.729s
> 
> After the benchmark:
> 
> grep Anon /proc/meminfo 
> AnonPages:        121300 kB
> AnonHugePages:   1007616 kB
> cat /debugfs/kvm/largepages 
> 2296
> 
> 1.6G free in guest and 1.5free in host.
> 
> Then on host:
> 
> # echo never > /sys//kernel/mm/transparent_hugepage/enabled 
> # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled 
> 
> then I restart the VM and re-run the same workload:
> 
> real 4m25.040s
> user 15m4.665s
> sys 1m50.519s
> 
> real 4m29.653s
> user 15m8.637s
> sys 1m49.631s
> 
> (guest kernel was not so recent and it had no transparent hugepage
> support because gcc normally won't take advantage of hugepages
> according to /proc/meminfo, so I made the comparison with a distro
> guest kernel with my usual .config I use in kvm guests)
> 
> So guest compile the kernel 6% faster with hugepages and the results
> are trivially reproducible and stable enough (especially with hugepage
> enabled, without it varies from 4m24 sto 4m30s as I tried a few times
> more without hugepages in NTP when userland wasn't patched yet...).
> 
> Below another test that takes advantage of hugepage in guest too, so
> running the same 2.6.34-rc1 with transparent hugepage support in both
> host and guest. (this really shows the power of KVM design, we boost
> the hypervisor and we get double boost for guest applications)
> 
> Workload: time dd if=/dev/zero of=/dev/null bs=128M count=100
> 
> Host hugepage no guest: 3.898
> Host hugepage guest hugepage: 3.966 (-1.17%)
> Host no hugepage no guest: 4.088 (-4.87%)
> Host hugepage guest no hugepage: 4.312 (-10.1%)
> Host no hugepage guest hugepage: 4.388 (-12.5%)
> Host no hugepage guest no hugepage: 4.425 (-13.5%)
> 
> Workload: time dd if=/dev/zero of=/dev/null bs=4M count=1000
> 
> Host hugepage no guest: 1.207
> Host hugepage guest hugepage: 1.245 (-3.14%)
> Host no hugepage no guest: 1.261 (-4.47%)
> Host no hugepage guest no hugepage: 1.323 (-9.61%)
> Host no hugepage guest hugepage: 1.371 (-13.5%)
> Host no hugepage guest no hugepage: 1.398 (-15.8%)
> 
> I've no local EPT system to test so I may run them over vpn later on
> some large EPT system (and surely there are better benchs than a silly
> dd... but this is a start and shows even basic stuff gets the boost).
> 
> The above is basically an "home-workstation/laptop" coverage. I
> (partly) intentionally run these on a system that has a ~$100 CPU and
> ~$50 motherboard, to show the absolute worst case, to be sure that
> 100% of home end users (running KVM) will take a measurable advantage
> from this effort.
> 
> On huge systems the percentage boost is expected much bigger than on
> the home-workstation above test of course.
> --------------
> 
> 
> Again gcc is a kind of worst case for it but it also shows a
> definitive significant and reproducible boost.
> 
> Also note for a non-virtualization usage (so outside of
> MADV_HUGEPAGE), invoking memory compaction synchronously is likely a
> risk of losing CPU speed. khugepaged takes care of long lived
> allocations of random tasks and the only thing to use memory
> compaction synchronously could be the page faults of regions marked
> MADV_HUGEPAGE. But we may only decide to invoke memory compaction
> asynchronously and never as result of direct reclaim in process
> context to avoid any latency to guest operations. All it matters after
> boot is that khugepaged can do its job, it's not urgent. When things
> are urgent migrating guests to a new cloud node is always possible.
> 
> I'd like to clarify this whole work has been done without ever making
> assumptions about virtual machines, I tried to make this as
> universally useful as possible (and not just because we want the exact
> same VM algorithms to trim one level of guest pagetables too to get a
> comulative boost so fully exploiting the KVM design ;). I'm thrilled
> Chris is going to test a host-only test for database and I'm sure
> willing to help with that.
> 
> Compacting everything that is "movable" is surely solvable from a
> theoretical standpoint and that includes all anonymous memory (huge or
> not) and all cache.

Page migration as it is handles these cases. It can't handle slab, page
table pages or some kernel allocations but anti-fragmentation does a
good job of grouping these allocations into the same 2M pages already -
particularly when min_free_kbytes is configured correctly.

> That alone accounts for an huge bulk of the total
> memory of a system, so being able to mix it all will result in the
> best behavior which isn't possible to achieve with hugetlbfs (so if
> the memory isn't allocated as anonymous memory can still be used as
> cache for I/O).> So in the very worst case, if everything else fails on
> the producer front (again: not the case as far as I can tell!) what
> should be reserved at boot is an amount of memory to limit the
> unmovable parts there.

This latter part is currently possible with the kernelcore=X boot parameter
so that the unmovable parts are limited to X amount of memory.  It shouldn't
be necessary to do this, but it is possible. If it is found that it is
required, I'd hope to receive a bug report on it.

> And to leave the movable parts free to be
> allocated dynamically without limitations depending on the workloads.
> 
> I'm quite sure Mel will be able to provide more details on his work
> that has been reviewed in detail already on linux-mm with lots of
> positive feedback which is why I expect zero problems on that side too
> in real life (besides my theoretical standpoint in previous chapter ;).
> 

The details of of what I have to say on compaction is covered in the compaction
leader http://lkml.org/lkml/2010/4/2/146 including allocation success rates
under severe compile-based load and data on allocation latencies.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  2:23                     ` Linus Torvalds
  2010-04-06  5:25                       ` Nick Piggin
  2010-04-06  9:08                       ` Ingo Molnar
@ 2010-04-06  9:55                       ` Avi Kivity
  2010-04-06  9:57                         ` Avi Kivity
  2010-04-06 11:55                         ` Avi Kivity
  2 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-06  9:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 05:23 AM, Linus Torvalds wrote:
>
> On Mon, 5 Apr 2010, Linus Torvalds wrote:
>    
>> So I thought it was a more interesting load than it was. The
>> virtualization "TLB miss is expensive" load I can't find it in myself to
>> care about. "Get a better CPU" is my answer to that one,
>>      
> [ Btw, I do realize that "better CPU" in this case may be "future CPU". I
>    just think that this is where better TLB's and using ASID's etc is
>    likely to be a much bigger deal than adding VM complexity. Kind of the
>    

For virtualization the tlb miss cost comes from two parts, first there 
are the 24 memory accesses needed for a tlb fill (instead of the usual 
4); these can indeed be improved by various intermediate tlbs (and 
current processors already do have those caches).  However something 
that cannot be solved by the tlb are the accesses to the last level of 
the page table hierarchy - as soon as the page tables exceed the cache 
size, you take two cache misses for each tlb miss.

Note virtualization only increases the hit, it also shows with 
non-virtualized loads, but there your cache utilization is halved and 
you only need one memory access for your last level page table.

Here is a microbenchmark demonstrating the hit (non-virtualized); it 
simulates a pointer-chasing application with a varying working set.  It 
is easy to see when the working set overflows the various caches, and 
later when the page tables overflow the caches.  For virtualization the 
hit will be a factor of 3 instead of 2, and will come earlier since the 
page tables are bigger.

  size   4k (ns)    2M (ns)
    4k       4.9        4.9
   16k       4.9        4.9
   64k       7.6        7.6
  256k      15.1        8.1
    1M      28.5       23.9
    4M      31.8       25.3
   16M      94.8       79.0
   64M     260.9      224.2
  256M     269.8      248.8
    1G     278.1      246.3
    4G     330.9      252.6
   16G     436.3      243.8
   64G     486.0      253.3

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  9:55                       ` Avi Kivity
@ 2010-04-06  9:57                         ` Avi Kivity
  2010-04-06 11:55                         ` Avi Kivity
  1 sibling, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-06  9:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 12:55 PM, Avi Kivity wrote:
>
> Here is a microbenchmark demonstrating the hit (non-virtualized); it 
> simulates a pointer-chasing application with a varying working set.  
> It is easy to see when the working set overflows the various caches, 
> and later when the page tables overflow the caches.  For 
> virtualization the hit will be a factor of 3 instead of 2, and will 
> come earlier since the page tables are bigger.
>
>  size   4k (ns)    2M (ns)
>    4k       4.9        4.9
>   16k       4.9        4.9
>   64k       7.6        7.6
>  256k      15.1        8.1
>    1M      28.5       23.9
>    4M      31.8       25.3
>   16M      94.8       79.0
>   64M     260.9      224.2
>  256M     269.8      248.8
>    1G     278.1      246.3
>    4G     330.9      252.6
>   16G     436.3      243.8
>   64G     486.0      253.3
>


(latencies are for a single read access)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  9:30               ` Mel Gorman
@ 2010-04-06 10:32                 ` Theodore Tso
  2010-04-06 11:16                   ` Mel Gorman
  0 siblings, 1 reply; 205+ messages in thread
From: Theodore Tso @ 2010-04-06 10:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


On Apr 6, 2010, at 5:30 AM, Mel Gorman wrote:
> 
> There is a good chance you could allocate a decent percentage of
> memory as huge pages but as you are unlikely to have run hugeadm
> --set-recommended-min_free_kbytes early in boot, it is also likely to trash
> heavily and the success rates will not be very impressive.

Can you explain how hugeadm --set-recommended-min_free_kbytes works and how it achieves this magic?  Or can you send me a pointer to how this works?   I've tried doing some Google searches, and I found the LWN article "Huge pages part 3: administration", but it doesn't go into a lot of detail how increasing vm.min_free_kbytes helps the anti fragmentation code.

Thanks,

-- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 10:32                 ` Theodore Tso
@ 2010-04-06 11:16                   ` Mel Gorman
  2010-04-06 13:13                     ` Theodore Tso
  0 siblings, 1 reply; 205+ messages in thread
From: Mel Gorman @ 2010-04-06 11:16 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 06:32:28AM -0400, Theodore Tso wrote:
> 
> On Apr 6, 2010, at 5:30 AM, Mel Gorman wrote:
> > 
> > There is a good chance you could allocate a decent percentage of
> > memory as huge pages but as you are unlikely to have run hugeadm
> > --set-recommended-min_free_kbytes early in boot, it is also likely to trash
> > heavily and the success rates will not be very impressive.
> 

> Can you explain how hugeadm --set-recommended-min_free_kbytes works and
> how it achieves this magic?  Or can you send me a pointer to how this works?
> I've tried doing some Google searches, and I found the LWN article "Huge
> pages part 3: administration", but it doesn't go into a lot of detail how
> increasing vm.min_free_kbytes helps the anti fragmentation code.

Sure, the details of how and why it works are spread all over the place.
It's fairly simple really and related to how anti-fragmentation does its work.

Anti-frag divides up a zone into "arenas" where an arena is usually the
default huge page size - 2M on x86-64, 16M on ppc64 etc. Its objective is to
keep UNMOVABLE, RECLAIMABLE and MOVABLE pages within the same arenas using
multiple free lists. If a page within the desired arena is not available, it
falls back to using one of the other arenas. A fallback is a "fragmentation
event" as traced by the mm_page_alloc_extfrag event. A severe event is if a
small page is used and a benign event is if a large page (e.g. 2M) is moved
to the desired list. It's benign because pages of the same "migrate type"
continue to be allocated within the same arena.

How often these "fragmentation events" occur depends on pages of the
desired type being always available. This in turn depends on free pages
being available which is easiest to control by min_free_kbytes and is where
--set-recommended-min_free_kbytes comes in. By keeping a number of pages free,
the probability of a page of the desired type being available increases.

As there are three migrate-types we currently care about from an anti-frag
perspective, the recommended min_free_kbytes value depends on the number of
zones in the system and having 3 arenas worth of pages are kept free per
zone. Once set, there will, in most cases, be a page free of the required
type at allocation time. It can be observed in practice by tracing
mm_page_alloc_extfrag.

The next part of min_free_kbytes is related to the "reserve" blocks which
are only important to high-order atomic allocations. There is a maximum of
two reserve blocks per zone. For example, on a flat-memory system with one
grouping of memory, there would be a maximum of two reserve arenas. On a
NUMA system with two nodes, there would be a maximum of four. With multiple
groupings of memory such as 32-bit X86 with DMA, Normal and Highmem groups of
free-lists, there might be five reserve pageblocks, two each for the Normal
and HighMem groupings and just one for DMA as it is only 16MB worth of pages.

The final part of the recommended min_free_kbytes value is a sum of the
reserve arenas and the migrate-type arenas to ensure that pages of the
required type are free.

The function that works this out in libhugetlbfs is

long recommended_minfreekbytes(void)
{
        FILE *f;
        char buf[ZONEINFO_LINEBUF];
        int nr_zones = 0;
        long recommended_min;
        long pageblock_kbytes = kernel_default_hugepage_size() / 1024;

        /* Detect the number of zones in the system */
        f = fopen(PROCZONEINFO, "r");
        if (f == NULL) {
                WARNING("Unable to open " PROCZONEINFO);
                return 0;
        }
        while (fgets(buf, ZONEINFO_LINEBUF, f) != NULL) {
                if (strncmp(buf, "Node ", 5) == 0)
                        nr_zones++;
        }
        fclose(f);

        /* Make sure at least 2 pageblocks are free for MIGRATE_RESERVE */
        recommended_min = pageblock_kbytes * nr_zones * 2;

        /*
         * Make sure that on average at least two pageblocks are almost free
         * of another type, one for a migratetype to fall back to and a
         * second to avoid subsequent fallbacks of other types There are 3
         * MIGRATE_TYPES we care about.
         */
        recommended_min += pageblock_kbytes * nr_zones * 3 * 3;
        return recommended_min;
}

Does this clarify why min_free_kbytes helps and why the "recommended"
value is what it is?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  8:30             ` Mel Gorman
@ 2010-04-06 11:35               ` Chris Mason
  0 siblings, 0 replies; 205+ messages in thread
From: Chris Mason @ 2010-04-06 11:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Avi Kivity, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 09:30:28AM +0100, Mel Gorman wrote:
> On Tue, Apr 06, 2010 at 12:18:24AM +0300, Avi Kivity wrote:
> > On 04/06/2010 12:01 AM, Chris Mason wrote:
> >> On Mon, Apr 05, 2010 at 01:32:21PM -0700, Linus Torvalds wrote:
> >>    
> >>>
> >>> On Mon, 5 Apr 2010, Pekka Enberg wrote:
> >>>      
> >>>> AFAIK, most modern GCs split memory in young and old generation
> >>>> "zones" and _copy_ surviving objects from the former to the latter if
> >>>> their lifetime exceeds some threshold. The JVM keeps scanning the
> >>>> smaller young generation very aggressively which causes TLB pressure
> >>>> and scans the larger old generation less often.
> >>>>        
> >>> .. my only input to this is: numbers talk, bullsh*t walks.
> >>>
> >>> I'm not interested in micro-benchmarks, either. I can show infinite TLB
> >>> walk improvement in a microbenchmark.
> >>>      
> >> Ok, I'll bite.  I should be able to get some database workloads with
> >> hugepages, transparent hugepages, and without any hugepages at all.
> >>    
> >
> > Please run them in conjunction with Mel Gorman's memory compaction,  
> > otherwise fragmentation may prevent huge pages from being instantiated.
> >
> 
> Strictly speaking, compaction is not necessary to allocate huge pages.
> What compaction gets you is
> 
>   o Lower latency and cost of huge page allocation
>   o Works on swapless systems
> 
> What is important is that you run
> hugeadm --set-recommended-min_free_kbytes
> from the libhugetlbfs 2.8 package early in boot so that
> anti-fragmentation is doing as good as job as possible.

Great, I'll make sure to do this.

> If one is very
> curious, use the mm_page_alloc_extfrag to trace how often severe
> fragmentation-related events occur under default settings and with
> min_free_kbytes set properly.
> 
> Without the compaction patches, allocating huge pages will be occasionally
> *very* expensive as a large number of pages will need to be reclaimed.
> Most likely sympton is trashing while the database starts up. Allocation
> success rates will also be lower when under heavy load.
> 
> Running make -j16 at the same time is unlikely to make much of a
> difference from a hugepage allocation point of view. The performance
> figures will vary significantly of course as make competes with the
> database for CPU time and other resources.

Heh, Linus did actually say to run them concurrently with make -j16, but
I read it as make -j16 before the database run.  My goal will be to
fragment the ram, then get a db in ram and see how fast it all goes.

Fragmenting memory during the run is only interesting to test compaction, I'd
throw out the resulting db benchmark numbers and only count the
number of transparent hugepages we were able to allocate.

> 
> Finally, benchmarking with databases is not new as such -
> http://lwn.net/Articles/378641/ . This was on fairly simple hardware
> though as I didn't have access to hardware more suitable for database
> workloads. If you are running with transparent huge pages though, be
> sure to double check that huge pages are actually being used
> transparently.

Will do.  It'll take me a few days to get the machines setup and a
baseline measurement.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  9:55                       ` Avi Kivity
  2010-04-06  9:57                         ` Avi Kivity
@ 2010-04-06 11:55                         ` Avi Kivity
  2010-04-06 13:10                           ` Nick Piggin
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 11:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 12:55 PM, Avi Kivity wrote:
>
> Here is a microbenchmark demonstrating the hit (non-virtualized); it 
> simulates a pointer-chasing application with a varying working set.  
> It is easy to see when the working set overflows the various caches, 
> and later when the page tables overflow the caches.  For 
> virtualization the hit will be a factor of 3 instead of 2, and will 
> come earlier since the page tables are bigger.
>

And here is the same thing with guest latencies as well:

Random memory read latency, in nanoseconds, according to working
set and page size.


        ------- host ------  ------------- guest -----------
                             --- hpage=4k ---  -- hpage=2M -

  size        4k         2M     4k/4k   2M/4k   4k/2M  2M/2M
    4k       4.9        4.9       5.0     4.9     4.9    4.9
   16k       4.9        4.9       5.0     4.9     5.0    4.9
   64k       7.6        7.6       7.9     7.8     7.8    7.8
  256k      15.1        8.1      15.9    10.3    15.4    9.0
    1M      28.5       23.9      29.3    37.9    29.3   24.6
    4M      31.8       25.3      37.5    42.6    35.5   26.0
   16M      94.8       79.0     110.7   107.3    92.0   77.3
   64M     260.9      224.2     294.2   247.8   251.5  207.2
  256M     269.8      248.8     313.9   253.1   260.1  230.3
    1G     278.1      246.3     331.8   273.0   269.9  236.7
    4G     330.9      252.6     545.6   346.0   341.6  256.5
   16G     436.3      243.8     705.2   458.3   463.9  268.8
   64G     486.0      253.3     767.3   532.5   516.9  274.7


It's easy to see how cache effects dominate the tlb walk.  The only way 
hardware can reduce this is by increasing cache sizes dramatically.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 11:55                         ` Avi Kivity
@ 2010-04-06 13:10                           ` Nick Piggin
  2010-04-06 13:22                             ` Avi Kivity
                                               ` (2 more replies)
  0 siblings, 3 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-06 13:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrea Arcangeli, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 02:55:32PM +0300, Avi Kivity wrote:
> On 04/06/2010 12:55 PM, Avi Kivity wrote:
> >
> >Here is a microbenchmark demonstrating the hit (non-virtualized);
> >it simulates a pointer-chasing application with a varying working
> >set.  It is easy to see when the working set overflows the various
> >caches, and later when the page tables overflow the caches.  For
> >virtualization the hit will be a factor of 3 instead of 2, and
> >will come earlier since the page tables are bigger.
> >
> 
> And here is the same thing with guest latencies as well:
> 
> Random memory read latency, in nanoseconds, according to working
> set and page size.
> 
> 
>        ------- host ------  ------------- guest -----------
>                             --- hpage=4k ---  -- hpage=2M -
> 
>  size        4k         2M     4k/4k   2M/4k   4k/2M  2M/2M
>    4k       4.9        4.9       5.0     4.9     4.9    4.9
>   16k       4.9        4.9       5.0     4.9     5.0    4.9
>   64k       7.6        7.6       7.9     7.8     7.8    7.8
>  256k      15.1        8.1      15.9    10.3    15.4    9.0
>    1M      28.5       23.9      29.3    37.9    29.3   24.6
>    4M      31.8       25.3      37.5    42.6    35.5   26.0
>   16M      94.8       79.0     110.7   107.3    92.0   77.3
>   64M     260.9      224.2     294.2   247.8   251.5  207.2
>  256M     269.8      248.8     313.9   253.1   260.1  230.3
>    1G     278.1      246.3     331.8   273.0   269.9  236.7
>    4G     330.9      252.6     545.6   346.0   341.6  256.5
>   16G     436.3      243.8     705.2   458.3   463.9  268.8
>   64G     486.0      253.3     767.3   532.5   516.9  274.7
> 
> 
> It's easy to see how cache effects dominate the tlb walk.  The only
> way hardware can reduce this is by increasing cache sizes
> dramatically.

Well this is the best attainable speedup in a corner case where the
whole memory hierarchy is being actively defeated. The numbers are
not surprising. Actual workloads are infinitely more useful. And in
most cases, quite possibly hardware improvements like asids will
be more useful.

I don't really agree with how virtualization problem is characterised.
Xen's way of doing memory virtualization maps directly to normal
hardware page tables so there doesn't seem like a fundamental
requirement for more memory accesses.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 11:16                   ` Mel Gorman
@ 2010-04-06 13:13                     ` Theodore Tso
  2010-04-06 14:55                       ` Mel Gorman
  2010-04-06 16:46                       ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Theodore Tso @ 2010-04-06 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote:

> 
> Does this clarify why min_free_kbytes helps and why the "recommended"
> value is what it is?

Thanks, this is really helpful.   I wonder if it might be a good idea to have a boot command-line option which automatically sets vm.min_free_kbytes to the right value?   Most administrators who are used to using hugepages, are most familiar with needing to set boot command-line options, and this way they won't need to try to find this new userspace utility.   I was looking for hugeadm on Ubuntu, for example, and I couldn't find it.

Regards,

-- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:10                           ` Nick Piggin
@ 2010-04-06 13:22                             ` Avi Kivity
  2010-04-06 13:45                               ` Nick Piggin
  2010-04-06 14:44                             ` Rik van Riel
  2010-04-06 16:43                             ` Andrea Arcangeli
  2 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrea Arcangeli, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 04:10 PM, Nick Piggin wrote:
> On Tue, Apr 06, 2010 at 02:55:32PM +0300, Avi Kivity wrote:
>    
>> On 04/06/2010 12:55 PM, Avi Kivity wrote:
>>      
>>> Here is a microbenchmark demonstrating the hit (non-virtualized);
>>> it simulates a pointer-chasing application with a varying working
>>> set.  It is easy to see when the working set overflows the various
>>> caches, and later when the page tables overflow the caches.  For
>>> virtualization the hit will be a factor of 3 instead of 2, and
>>> will come earlier since the page tables are bigger.
>>>
>>>        
>> And here is the same thing with guest latencies as well:
>>
>> Random memory read latency, in nanoseconds, according to working
>> set and page size.
>>
>>
>>         ------- host ------  ------------- guest -----------
>>                              --- hpage=4k ---  -- hpage=2M -
>>
>>   size        4k         2M     4k/4k   2M/4k   4k/2M  2M/2M
>>     4k       4.9        4.9       5.0     4.9     4.9    4.9
>>    16k       4.9        4.9       5.0     4.9     5.0    4.9
>>    64k       7.6        7.6       7.9     7.8     7.8    7.8
>>   256k      15.1        8.1      15.9    10.3    15.4    9.0
>>     1M      28.5       23.9      29.3    37.9    29.3   24.6
>>     4M      31.8       25.3      37.5    42.6    35.5   26.0
>>    16M      94.8       79.0     110.7   107.3    92.0   77.3
>>    64M     260.9      224.2     294.2   247.8   251.5  207.2
>>   256M     269.8      248.8     313.9   253.1   260.1  230.3
>>     1G     278.1      246.3     331.8   273.0   269.9  236.7
>>     4G     330.9      252.6     545.6   346.0   341.6  256.5
>>    16G     436.3      243.8     705.2   458.3   463.9  268.8
>>    64G     486.0      253.3     767.3   532.5   516.9  274.7
>>
>>
>> It's easy to see how cache effects dominate the tlb walk.  The only
>> way hardware can reduce this is by increasing cache sizes
>> dramatically.
>>      
> Well this is the best attainable speedup in a corner case where the
> whole memory hierarchy is being actively defeated. The numbers are
> not surprising.

Of course this shows the absolute worst case and will never show up 
directly in any real workload.  The point wasn't that we expect a 3x 
speedup from large pages (far from it), but to show the problem is due 
to page tables overflowing the cache, not to any miss handler 
inefficiency.  It also shows that virtualization only increases the 
impact, but isn't the direct cause.  The real problem is large active 
working sets.

> Actual workloads are infinitely more useful. And in
> most cases, quite possibly hardware improvements like asids will
> be more useful.
>    

This already has ASIDs for the guest; and for the host they wouldn't 
help much since there's only one process running.  I don't see how 
hardware improvements can drastically change the numbers above, it's 
clear that for the 4k case the host takes a cache miss for the pte, and 
twice for the 4k/4k guest case.

> I don't really agree with how virtualization problem is characterised.
> Xen's way of doing memory virtualization maps directly to normal
> hardware page tables so there doesn't seem like a fundamental
> requirement for more memory accesses.
>    

The Xen pv case only works for modified guests (so no Windows), and 
doesn't support host memory management like swapping or ksm.  Xen hvm 
(which runs unmodified guests) has the same problems as kvm.

Note kvm can use a single layer of translation (and does on older 
hardware), so it would behave like the host, but that increases the cost 
of pte updates dramatically.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:22                             ` Avi Kivity
@ 2010-04-06 13:45                               ` Nick Piggin
  2010-04-06 13:57                                 ` Avi Kivity
                                                   ` (2 more replies)
  0 siblings, 3 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-06 13:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Andrea Arcangeli, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 04:22:37PM +0300, Avi Kivity wrote:
> On 04/06/2010 04:10 PM, Nick Piggin wrote:
> >Actual workloads are infinitely more useful. And in
> >most cases, quite possibly hardware improvements like asids will
> >be more useful.
> 
> This already has ASIDs for the guest; and for the host they wouldn't
> help much since there's only one process running.

I didn't realize these improvements were directed completely at the
virtualized case.


>  I don't see how
> hardware improvements can drastically change the numbers above, it's
> clear that for the 4k case the host takes a cache miss for the pte,
> and twice for the 4k/4k guest case.

It's because you're missing the point. You're taking the most
unrealistic and pessimal cases and then showing that it has fundamental
problems. Speedups like Linus is talking about would refer to ways to
speed up actual workloads, not ways to avoid fundamental limitations.

Prefetching, memory parallelism, caches. It's worked for 25 years :)

 
> >I don't really agree with how virtualization problem is characterised.
> >Xen's way of doing memory virtualization maps directly to normal
> >hardware page tables so there doesn't seem like a fundamental
> >requirement for more memory accesses.
> 
> The Xen pv case only works for modified guests (so no Windows), and
> doesn't support host memory management like swapping or ksm.  Xen
> hvm (which runs unmodified guests) has the same problems as kvm.
> 
> Note kvm can use a single layer of translation (and does on older
> hardware), so it would behave like the host, but that increases the
> cost of pte updates dramatically.

So it is fundamentally possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:45                               ` Nick Piggin
@ 2010-04-06 13:57                                 ` Avi Kivity
  2010-04-06 16:50                                 ` Andrea Arcangeli
  2010-04-06 18:47                                 ` Avi Kivity
  2 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 13:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrea Arcangeli, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 04:45 PM, Nick Piggin wrote:
> On Tue, Apr 06, 2010 at 04:22:37PM +0300, Avi Kivity wrote:
>    
>> On 04/06/2010 04:10 PM, Nick Piggin wrote:
>>      
>>> Actual workloads are infinitely more useful. And in
>>> most cases, quite possibly hardware improvements like asids will
>>> be more useful.
>>>        
>> This already has ASIDs for the guest; and for the host they wouldn't
>> help much since there's only one process running.
>>      
> I didn't realize these improvements were directed completely at the
> virtualized case.
>    

I've read somewhere that future x86 will get non virtualization ASIDs, 
but currently that's the case.  They've been present for the virtualized 
case for a few years now on AMD and introduced recently (with Nehalem) 
on Intel (known as VPIDs).

>>   I don't see how
>> hardware improvements can drastically change the numbers above, it's
>> clear that for the 4k case the host takes a cache miss for the pte,
>> and twice for the 4k/4k guest case.
>>      
> It's because you're missing the point. You're taking the most
> unrealistic and pessimal cases and then showing that it has fundamental
> problems.

That's just a demonstration.  Again, I don't expect 3x speedups from 
large pages.

> Speedups like Linus is talking about would refer to ways to
> speed up actual workloads, not ways to avoid fundamental limitations.
>
> Prefetching, memory parallelism, caches. It's worked for 25 years :)
>    

Prefetching and memory parallelism are defeated by pointer chasing, 
which many workloads do.  It's no accident that Java is a large 
beneficiary of large pages since Java programs are lots of small objects 
scattered around in memory.

Caches don't scale as fast as memory, and are shared with data and other 
cores anyway.

If you have 200ns of honest work per pointer dereference, then a 64GB 
working set will still see 300ns stalls with 4k pages vs 50 ns with 
large pages (both non-virtualized).  200ns is quite a bit of work per 
object.


>>> I don't really agree with how virtualization problem is characterised.
>>> Xen's way of doing memory virtualization maps directly to normal
>>> hardware page tables so there doesn't seem like a fundamental
>>> requirement for more memory accesses.
>>>        
>> The Xen pv case only works for modified guests (so no Windows), and
>> doesn't support host memory management like swapping or ksm.  Xen
>> hvm (which runs unmodified guests) has the same problems as kvm.
>>
>> Note kvm can use a single layer of translation (and does on older
>> hardware), so it would behave like the host, but that increases the
>> cost of pte updates dramatically.
>>      
> So it is fundamentally possible.
>    

The costs are much bigger than the gain, especially when scaling the 
number of vcpus.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:10                           ` Nick Piggin
  2010-04-06 13:22                             ` Avi Kivity
@ 2010-04-06 14:44                             ` Rik van Riel
  2010-04-06 16:43                             ` Andrea Arcangeli
  2 siblings, 0 replies; 205+ messages in thread
From: Rik van Riel @ 2010-04-06 14:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linus Torvalds, Andrea Arcangeli, Pekka Enberg,
	Ingo Molnar, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 09:10 AM, Nick Piggin wrote:

> I don't really agree with how virtualization problem is characterised.
> Xen's way of doing memory virtualization maps directly to normal
> hardware page tables so there doesn't seem like a fundamental
> requirement for more memory accesses.

Xen also uses nested paging whereever possible, because shadow
page tables are even slower than nested page tables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:13                     ` Theodore Tso
@ 2010-04-06 14:55                       ` Mel Gorman
  2010-04-06 16:46                       ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Mel Gorman @ 2010-04-06 14:55 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 09:13:20AM -0400, Theodore Tso wrote:
> 
> On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote:
> 
> > 
> > Does this clarify why min_free_kbytes helps and why the "recommended"
> > value is what it is?
> 
> Thanks, this is really helpful. I wonder if it might be a good idea to
> have a boot command-line option which automatically sets vm.min_free_kbytes
> to the right value? 

I considered automatically adjusting it the first time huge pages are used,
as a command-line option or even a magic value writting to proc.  It's trivial
to implement each option, just haven't gotten around to doing it. There was
less pressure once the tool existed.

> Most administrators who are used to using hugepages,
> are most familiar with needing to set boot command-line options, and this way
> they won't need to try to find this new userspace utility. 

The utility covers a host of other use cases as well e.g. creates mount
points, sets quota, sizes pools (both static and dynamic), reports on the
current state of the system, can auto tune shmem settings etc.

> I was looking
> for hugeadm on Ubuntu, for example, and I couldn't find it.

It's relatively recent and there isn't debian packaging for it (although an
old one was sent to debian mentors once upon a time but never finished). It's
on the TODO list of infinite woe to finish that packaging and go through
Debian so it ends up in Ubuntu eventually.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:10                           ` Nick Piggin
  2010-04-06 13:22                             ` Avi Kivity
  2010-04-06 14:44                             ` Rik van Riel
@ 2010-04-06 16:43                             ` Andrea Arcangeli
  2 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-06 16:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Nick,

On Tue, Apr 06, 2010 at 11:10:24PM +1000, Nick Piggin wrote:
> most cases, quite possibly hardware improvements like asids will
> be more useful.

ASID already exists, they're not about preventing a vmexit for every
tlb flush or alternatively guest pagetable updates.

In short NPT/EPT is to ASID are what x86-64 is to PAE, not the other
way around. It simplifies things and speedup server workloads
tremendously. ASID if you want it, then you've to put it in OS guest
to manage or in regular linux on host regardless of virtualization on
or off.

Anyway hugetlbfs exists in linux way before virtualization ever
exited, so I guess we should keep the virtualization talk aside for
now to make everyone happy, I already once said in this thread this
whole work has been done in a way not specific to virtualization, and
let's focus on applications that have larger working set than
gcc/vi/make/git and somebody should explain why exactly hugetlbfs is
included in the 2.6.34 kernel if tlb miss cost doesn't matter, and why
so much work keeps going in the hugetlbfs direction including the 1g
page size and java runs on hugetlbfs, oracle runs on hugetlbfs,
etc... tons of apps are using libhugetlbfs and hugetlbfs is growing
like its own VM that eventually will be able to swap of its own.

> I don't really agree with how virtualization problem is characterised.
> Xen's way of doing memory virtualization maps directly to normal
> hardware page tables so there doesn't seem like a fundamental
> requirement for more memory accesses.

Xen also takes advantage of NPT/EPT, when it does it sure has the same
hardware runtime cost of KVM without hugepages, unless Xen or the
guest or both are using hugepages somewhere and trimming the pte level
from the shadow or guest pagetables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:13                     ` Theodore Tso
  2010-04-06 14:55                       ` Mel Gorman
@ 2010-04-06 16:46                       ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-06 16:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Mel Gorman, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 09:13:20AM -0400, Theodore Ts'o wrote:
> 
> On Apr 6, 2010, at 7:16 AM, Mel Gorman wrote:
> 
> > 
> > Does this clarify why min_free_kbytes helps and why the "recommended"
> > value is what it is?
> 
> Thanks, this is really helpful.   I wonder if it might be a good idea to have a boot command-line option which automatically sets vm.min_free_kbytes to the right value?   Most administrators who are used to using hugepages, are most familiar with needing to set boot command-line options, and this way they won't need to try to find this new userspace utility.   I was looking for hugeadm on Ubuntu, for example, and I couldn't find it.

It's part of libhugetlbfs. I also suggested in a earlier email this
would better be "echo 1
>/sys/kernel/vm/set-recommended-min_free_kbytes" or
set-recommended-min_free_kbytes=1 at boot considering it's a 10 liner
piece of code that does the math to set it. But it's no big deal on my
side, the important thing is that we have that feature.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:45                               ` Nick Piggin
  2010-04-06 13:57                                 ` Avi Kivity
@ 2010-04-06 16:50                                 ` Andrea Arcangeli
  2010-04-06 17:31                                   ` Avi Kivity
  2010-04-06 18:47                                 ` Avi Kivity
  2 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-06 16:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote:
> problems. Speedups like Linus is talking about would refer to ways to
> speed up actual workloads, not ways to avoid fundamental limitations.
> 
> Prefetching, memory parallelism, caches. It's worked for 25 years :)

This will always give you a worst case additional 6% on top (gcc is a
definitive worst case) of all other speedup of the actual workloads,
for server loads more likely >=15% boost. It's plain underclocking
your CPU not to run this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 16:50                                 ` Andrea Arcangeli
@ 2010-04-06 17:31                                   ` Avi Kivity
  2010-04-06 18:00                                     ` Christoph Lameter
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 17:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Linus Torvalds, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 07:50 PM, Andrea Arcangeli wrote:
> On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote:
>    
>> problems. Speedups like Linus is talking about would refer to ways to
>> speed up actual workloads, not ways to avoid fundamental limitations.
>>
>> Prefetching, memory parallelism, caches. It's worked for 25 years :)
>>      
> This will always give you a worst case additional 6% on top (gcc is a
> definitive worst case) of all other speedup of the actual workloads,
> for server loads more likely>=15% boost. It's plain underclocking
> your CPU not to run this.
>    

I don't think gcc is worst case.  Workloads that benefit from large 
pages are those with bloated working sets that do a lot of pointer 
chasing and do little computation in between.  gcc fits two out of three 
(just a partial score on the first).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 17:31                                   ` Avi Kivity
@ 2010-04-06 18:00                                     ` Christoph Lameter
  2010-04-06 18:04                                       ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Christoph Lameter @ 2010-04-06 18:00 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Nick Piggin, Linus Torvalds, Pekka Enberg,
	Ingo Molnar, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, 6 Apr 2010, Avi Kivity wrote:

> On 04/06/2010 07:50 PM, Andrea Arcangeli wrote:
> > On Tue, Apr 06, 2010 at 11:45:39PM +1000, Nick Piggin wrote:
> >
> > > problems. Speedups like Linus is talking about would refer to ways to
> > > speed up actual workloads, not ways to avoid fundamental limitations.
> > >
> > > Prefetching, memory parallelism, caches. It's worked for 25 years :)
> > >
> > This will always give you a worst case additional 6% on top (gcc is a
> > definitive worst case) of all other speedup of the actual workloads,
> > for server loads more likely>=15% boost. It's plain underclocking
> > your CPU not to run this.
> >
>
> I don't think gcc is worst case.  Workloads that benefit from large pages are
> those with bloated working sets that do a lot of pointer chasing and do little
> computation in between.  gcc fits two out of three (just a partial score on
> the first).

Once you have huge pages you will likely start to optimize for locality.

Pointer chasing is bad even with huge pages if you go between multiple
huge pages and you are beyond the number of huge tlb entries supported by
the cpu.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 18:00                                     ` Christoph Lameter
@ 2010-04-06 18:04                                       ` Avi Kivity
  0 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 18:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Nick Piggin, Linus Torvalds, Pekka Enberg,
	Ingo Molnar, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 09:00 PM, Christoph Lameter wrote:
>
>> I don't think gcc is worst case.  Workloads that benefit from large pages are
>> those with bloated working sets that do a lot of pointer chasing and do little
>> computation in between.  gcc fits two out of three (just a partial score on
>> the first).
>>      
> Once you have huge pages you will likely start to optimize for locality.
>
> Pointer chasing is bad even with huge pages if you go between multiple
> huge pages and you are beyond the number of huge tlb entries supported by
> the cpu.
>    

A hugetlb miss is serviced from the L2 or L3 cache.  A smalltlb miss is 
serviced from main memory.  The miss rate is important, but not nearly 
as important as fill latency.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06 13:45                               ` Nick Piggin
  2010-04-06 13:57                                 ` Avi Kivity
  2010-04-06 16:50                                 ` Andrea Arcangeli
@ 2010-04-06 18:47                                 ` Avi Kivity
  2 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-06 18:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrea Arcangeli, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/06/2010 04:45 PM, Nick Piggin wrote:
>
>>   I don't see how
>> hardware improvements can drastically change the numbers above, it's
>> clear that for the 4k case the host takes a cache miss for the pte,
>> and twice for the 4k/4k guest case.
>>      
> It's because you're missing the point. You're taking the most
> unrealistic and pessimal cases and then showing that it has fundamental
> problems. Speedups like Linus is talking about would refer to ways to
> speed up actual workloads, not ways to avoid fundamental limitations.
>
> Prefetching, memory parallelism, caches. It's worked for 25 years :)
>    

btw, a workload that's known to benefit greatly from large pages is the 
kernel itself.  It's very pointer-chasey and has a large working set 
(the whole of memory, in fact).  But once you run it in a guest you've 
turned it into the 2M/4k case in the table which is basically a slightly 
slower version of host 4k pages.

So, if we want good support for kernel intensive workloads in guests, or 
kernel-like workloads in the host (or kernel-like workloads in guest 
userspace), then we need good large page support.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-06  9:08                       ` Ingo Molnar
  2010-04-06  9:13                         ` Ingo Molnar
@ 2010-04-10 18:47                         ` Andrea Arcangeli
  2010-04-10 19:02                           ` Ingo Molnar
  2010-04-12 14:24                           ` Christoph Lameter
  1 sibling, 2 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-10 18:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Ingo,

On Tue, Apr 06, 2010 at 11:08:13AM +0200, Ingo Molnar wrote:
> The goal of Andrea's and Mel's patch-set, to make this 'final performance 
> boost' more practical seems like a valid technical goal.

The integration in my current git tree (#19+):

git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later -> git fetch; git checkout -f origin/master

is working great and runs rock solid after the last integration bugfix
in migrate.c, enjoy! ;)

This is on my workstation, after building a ton of packages (including
javac binaries and all sort of other random stuff), lots of kernels,
mutt on large maildir folders, and running lots of ebuild that is
super heavy in vfs terms.

# free
             total       used       free     shared    buffers     cached
Mem:       3923408    2536380    1387028          0     482656    1194228
-/+ buffers/cache:     859496    3063912
Swap:      4200960        788    4200172
# uptime
 20:09:50 up 1 day, 13:19, 11 users,  load average: 0.00, 0.00, 0.00
# cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index
Node 0, zone      DMA      4      2      3      2      2      0      1      0      1      1      3
Node 0, zone    DMA32  10402  32864  10477   3729   2154   1156    471    136     22     50     41
Node 0, zone   Normal    196    155     40     21     16      7      4      1      0      2      0
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.992
Node 0, zone      DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226
Node 0, zone    DMA32 0.000 0.030 0.223 0.347 0.434 0.536 0.644 0.733 0.784 0.801 0.876
Node 0, zone   Normal 0.000 0.072 0.185 0.244 0.306 0.400 0.482 0.576 0.623 0.623 1.000
# time echo 3 > /proc/sys/vm/drop_caches

real    0m0.989s
user    0m0.000s
sys     0m0.984s
# time echo > /proc/sys/vm/compact_memory

real    0m0.195s
user    0m0.000s
sys     0m0.124s
# cat /proc/buddyinfo /proc/extfrag_index /proc/unusable_index
Node 0, zone      DMA      4      2      3      2      2      0      1      0      1      1      3
Node 0, zone    DMA32   1632   1444   1336   1065    748    449    229    128     59     50    685
Node 0, zone   Normal   1046    783    552    367    261    176    116     82     50     43     15
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone      DMA 0.000 0.001 0.002 0.005 0.009 0.017 0.017 0.033 0.033 0.097 0.226
Node 0, zone    DMA32 0.000 0.001 0.005 0.012 0.022 0.037 0.054 0.072 0.092 0.111 0.142
Node 0, zone   Normal 0.000 0.012 0.030 0.056 0.090 0.139 0.205 0.291 0.414 0.563 0.820
# free  
             total       used       free     shared    buffers     cached
Mem:       3923408     295240    3628168          0       4636      23192
-/+ buffers/cache:     267412    3655996
Swap:      4200960        788    4200172
# grep Anon /proc/meminfo 
AnonPages:        210472 kB
AnonHugePages:    102400 kB

(now AnonPages includes AnonHugePages, for backwards compatibility,
sorry about not having done it earlier, so ~50% of anon ram is in
hugepages)

MB of hugepages before drop_caches+compact_memory:

>>> (41)*4+(52)*2
268

MB of hugepages after drop_caches+compact_memory:

>>> (685+15)*4+(50+43)*2
2986

Total ram free: 3543 MB. 84% of the RAM not affected by unmovable
stuff after huge vfs slab load for about 2 days.

On laptop I got an huge swap storm that killed kdeinit4 with the oom
killer while I was away (found the login back in kdm4 when I got
back). that supposedly splitted all hugepages and now I after a while
I got all hugepages back:

# grep Anon /proc/meminfo 
AnonPages:        767680 kB
AnonHugePages:    395264 kB
# uptime
 20:33:33 up 1 day, 13:45,  9 users,  load average: 0.00, 0.00, 0.00
# dmesg|grep kill
Out of memory: kill process 8869 (kdeinit4) score 320362 or a child
# 


(50% of ram in hugepages and 400M more of hugepages immediately
available after invoking drop_caches/compact_memory manually with
the two sysctl)

And if this isn't enough kernelcore= can also provide an even stronger
guarantee to prevent unmovable stuff to spill over and start shrinking
freeable slab before it's too late.

The drop caches would be run by try_to_free_pages internally which is
interlevated with the try_to_compact_pages calls of course, so this is
to show the full potential of set_recommended_min_free_kbytes
(in-kernel automatically run at late_initcall unless you boot with
transparent_hugepage=0) and memory compaction, on top of the already
compound-aware try_to_free_pages (in addition of the order fallback
with movable/unmovable of set_recommended_min_free_kbytes). And
without using kernelcore= but allowing ebuild and other heavy slab
unmovable users to grow as much as they want and with only 3G of ram.

The sluggishness of invoking alloc_pages with __GFP_WAIT from hugepage
page faults (synchronously in direct reclaim) also completely gone
away after I tracked it down to lumpy reclaim that I simply nuked.

This is already fully usable and works great, and as Avi showed it
boosts even a sort on host by 6%, think about HPC applications, and
soon I hope to boost gcc on host by 6% (and of >15% in guest with
NPT/EPT) by extending vm_end in 2M chunks in glibc, at least for those
huge gcc builds taking >200M like translate.o of qemu-kvm... (so I
hope soon gcc running on KVM guest, thanks to EPT/NPT, will run faster
than on mainline kernel without transparent hugepages on bare metal).

Now I'll add numa awareness by adding alloc_pages_vma and make a #20
release which is one last relevant bit... Then we may want to address
smaps to show hugepages per process instead of only global in /proc/meminfo.

The only tuning I might recommend to people benchmarking on top of
current aa.git, is to compare the workloads with:

echo always >/sys/kernel/mm/transparent_hugepage/defrag # default setting at boot
echo never >/sys/kernel/mm/transparent_hugepage/defrag

And also to speedup khugepaged by decreasing
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
(that will workaround the vm_end not being extended in 2M chunk).

There's also one sysctl called /proc/sys/vm/extfrag_threshold that
allows to tune memory compaction aggressiveness but I wouldn't twiddle
with it, supposedly it'll go away and be replaced by a future
exponential backoff based logic to interleave the
try_to_compact_pages/try_to_free_pages optimally and more dynamically
than the sysctl (discussion on linux-mm). But it's not an huge
priority at the moment, it already works great like this and it
absolutely never becomes sluggish and it's always responsive since I
nuked lumpy-reclaim. The half jiffy average wait time definitely not
necessary and it would be lost in the noise compared to addressing the
major problem we had in calling try_to_free_pages with order = 9 and
__GFP_WAIT.

> In fact the whole maintenance thought process seems somewhat similar to the 
> TSO situation: the networking folks first rejected TSO based on complexity 
> arguments, but then was embraced after some time.

Full agreement! I think everyone wants transparent hugepage, the only
compliant I ever heard so far is from Christoph that has some slight
preference on not introducing split_huge_page and going full hugepage
everywhere, with native in gup immediately where GUP only returns head
pages and every caller has to check PageTransHuge on them to see if
it's huge or not. Changing several hundred of drivers in one go and
with native swapping with hugepage backed swapcache immediately, which
means also pagecache has to deal with hugepages immediately, is
possible too, but I think this more gradual approach is easier to keep
under control, Rome wasn't built in a day. Surely in a second time I
want tmpfs backed by hugepages too at least. And maybe pagecache, but
it doesn't need to happen immediately. Also we've to keep in mind for
huge systems the PAGE_SIZE should eventually become 2M and those will
be able to take advantage of transparent hugepages for the 1G
pud_trans_huge, that will make HPC even faster. Anyway nothing
prevents to take Christoph's long term direction also by starting self
contained.

To me what is relevant is that everyone in the VM camp seems to want
transparent hugepages in some shape or form, because of the about
linear speedup they provide to everything running on them on bare
metal (and an more than linear cumulative speedup in case of nested
pagetables for obvious reasons), no matter what design that it is.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 18:47                         ` Andrea Arcangeli
@ 2010-04-10 19:02                           ` Ingo Molnar
  2010-04-10 19:22                             ` Avi Kivity
  2010-04-12 14:24                           ` Christoph Lameter
  1 sibling, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-10 19:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> [...]
> 
> This is already fully usable and works great, and as Avi showed it boosts 
> even a sort on host by 6%, think about HPC applications, and soon I hope to 
> boost gcc on host by 6% (and of >15% in guest with NPT/EPT) by extending 
> vm_end in 2M chunks in glibc, at least for those huge gcc builds taking 
> >200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM 
> guest, thanks to EPT/NPT, will run faster than on mainline kernel without 
> transparent hugepages on bare metal).

I think what would be needed is some non-virtualization speedup example of a 
'non-special' workload, running on the native/host kernel. 'sort' is an 
interesting usecase - could it be patched to use hugepages if it has to sort 
through lots of data?

Is it practical to run something like a plain make -jN kernel compile all in 
hugepages, and see a small but measurable speedup?

Although it's not an ideal workload for computational speedups at all because 
a lot of the time we spend in a kernel build is really buildup/teardown of 
process state/context and similar 'administrative' overhead, while the true 
'compilation work' is just a burst of a few dozen milliseconds and then we 
tear down all the state again. (It's very inefficient really.)

Something like GIMP calculations would be a lot more representative of the 
speedup potential. Is it possible to run the GIMP with transparent hugepages 
enabled for it?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 19:02                           ` Ingo Molnar
@ 2010-04-10 19:22                             ` Avi Kivity
  2010-04-10 19:47                               ` Ingo Molnar
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-10 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/10/2010 10:02 PM, Ingo Molnar wrote:
> * Andrea Arcangeli<aarcange@redhat.com>  wrote:
>
>    
>> [...]
>>
>> This is already fully usable and works great, and as Avi showed it boosts
>> even a sort on host by 6%, think about HPC applications, and soon I hope to
>> boost gcc on host by 6% (and of>15% in guest with NPT/EPT) by extending
>> vm_end in 2M chunks in glibc, at least for those huge gcc builds taking
>>      
>>> 200M like translate.o of qemu-kvm... (so I hope soon gcc running on KVM
>>>        
>> guest, thanks to EPT/NPT, will run faster than on mainline kernel without
>> transparent hugepages on bare metal).
>>      
> I think what would be needed is some non-virtualization speedup example of a
> 'non-special' workload, running on the native/host kernel. 'sort' is an
> interesting usecase - could it be patched to use hugepages if it has to sort
> through lots of data?
>    

In fact it works well unpatched, the 6% I measured was with the system sort.

Currently in order to use hugepages (with the 'always' option) the only 
requirement is that the application uses a few large vmas.

> Is it practical to run something like a plain make -jN kernel compile all in
> hugepages, and see a small but measurable speedup?
>    

I doubt it - kernel builds run in relatively little memory.  The link 
stage uses a lot of memory but is fairly fast (I guess due to the 
partial links before).  Building a template-heavy C++ application might 
show some gains.

> Although it's not an ideal workload for computational speedups at all because
> a lot of the time we spend in a kernel build is really buildup/teardown of
> process state/context and similar 'administrative' overhead, while the true
> 'compilation work' is just a burst of a few dozen milliseconds and then we
> tear down all the state again. (It's very inefficient really.)
>
> Something like GIMP calculations would be a lot more representative of the
> speedup potential. Is it possible to run the GIMP with transparent hugepages
> enabled for it?
>    

I thought of it, but raster work is too regular so speculative execution 
should hide the tlb fill latency.  It's also easy to code in a way which 
hides cache effects (no idea if it is actually coded that way).  Sort 
showed a speedup since it defeats branch prediction and thus the 
processor cannot pipeline the loop.

I thought ray tracers with large scenes should show a nice speedup, but 
setting this up is beyond my capabilities.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 19:22                             ` Avi Kivity
@ 2010-04-10 19:47                               ` Ingo Molnar
  2010-04-10 20:00                                 ` Andrea Arcangeli
  2010-04-10 20:24                                 ` Avi Kivity
  0 siblings, 2 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-10 19:47 UTC (permalink / raw)
  To: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser
  Cc: Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> > I think what would be needed is some non-virtualization speedup example of 
> > a 'non-special' workload, running on the native/host kernel. 'sort' is an 
> > interesting usecase - could it be patched to use hugepages if it has to 
> > sort through lots of data?
> 
> In fact it works well unpatched, the 6% I measured was with the system sort.

Yes - but you intentionally sorted something large - the question is, how big 
is the slowdown with small sizes (if there's a slowdown), where is the 
break-even point (if any)?

> > [...]
> >
> > Something like GIMP calculations would be a lot more representative of the 
> > speedup potential. Is it possible to run the GIMP with transparent 
> > hugepages enabled for it?
> 
> I thought of it, but raster work is too regular so speculative execution 
> should hide the tlb fill latency.  It's also easy to code in a way which 
> hides cache effects (no idea if it is actually coded that way).  Sort showed 
> a speedup since it defeats branch prediction and thus the processor cannot 
> pipeline the loop.

Would be nice to try because there's a lot of transformations within Gimp - 
and Gimp can be scripted. It's also a test for negatives: if there is an 
across-the-board _lack_ of speedups, it shows that it's not really general 
purpose but more specialistic.

If the optimization is specialistic, then that's somewhat of an argument 
against automatic/transparent handling. (even though even if the beneficiaries 
turn out to be only special workloads then transparency still has advantages.)

> I thought ray tracers with large scenes should show a nice speedup, but 
> setting this up is beyond my capabilities.

Oh, this tickled some memories: x264 compressed encoding can be very cache and 
TLB intense. Something like the encoding of a 350 MB video file:

  wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # NOTE: 350 MB!
  x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4

would be another thing worth trying with transparent-hugetlb enabled.

(i've Cc:-ed x264 benchmarking experts - in case i missed something)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 19:47                               ` Ingo Molnar
@ 2010-04-10 20:00                                 ` Andrea Arcangeli
  2010-04-10 20:10                                   ` Andrea Arcangeli
  2010-04-10 20:21                                   ` Jason Garrett-Glaser
  2010-04-10 20:24                                 ` Avi Kivity
  1 sibling, 2 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-10 20:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sat, Apr 10, 2010 at 09:47:51PM +0200, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> > > I think what would be needed is some non-virtualization speedup example of 
> > > a 'non-special' workload, running on the native/host kernel. 'sort' is an 
> > > interesting usecase - could it be patched to use hugepages if it has to 
> > > sort through lots of data?
> > 
> > In fact it works well unpatched, the 6% I measured was with the system sort.
> 
> Yes - but you intentionally sorted something large - the question is, how big 
> is the slowdown with small sizes (if there's a slowdown), where is the 
> break-even point (if any)?

The only chance there is a slowdown is if try_to_compact_pages or
try_to_free_pages takes longer and runs more frequently with order 9
allocations than try_to_free_pages would on a 0 order allocation. That
is only a problem for short-lived frequent allocations in case memory
compaction fails to provide some hugepage (as it'll run multiple times
even if not needed, which is what the future exponential backoff logic
is about).

This is why I recommended to run any "real life DB" benchmark with
both transparent_hugepage/defrag set to both "always" and
"never". "never" will practically make any slowdown impossible to
measure. The only other case where there's a potential for minor
slowdown compared to 4k pages is COW, the 2M copy will trash the cache
and we need it to use non temporal stores, but even that will be
offseted by having a boost in TLB terms saving memory accesses in the
ptes. Which is my reason for avoiding any optimistic prefault and to
only go huge when we get the TLB benefit in return (not just the
pagefault speedup, the pagefault speedup is a double edge sword, it
trashes more caches so you need more than that for it to be worth it).

> Would be nice to try because there's a lot of transformations within Gimp - 
> and Gimp can be scripted. It's also a test for negatives: if there is an 
> across-the-board _lack_ of speedups, it shows that it's not really general 
> purpose but more specialistic.
> 
> If the optimization is specialistic, then that's somewhat of an argument 
> against automatic/transparent handling. (even though even if the beneficiaries 
> turn out to be only special workloads then transparency still has advantages.)
> 
> > I thought ray tracers with large scenes should show a nice speedup, but 
> > setting this up is beyond my capabilities.
> 
> Oh, this tickled some memories: x264 compressed encoding can be very cache and 
> TLB intense. Something like the encoding of a 350 MB video file:
> 
>   wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # NOTE: 350 MB!
>   x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4
> 
> would be another thing worth trying with transparent-hugetlb enabled.
> 
> (i've Cc:-ed x264 benchmarking experts - in case i missed something)

It definitely worth trying... nice idea. But we need glibc to increase
vm_end in 2M aligned chunk, otherwise we've to workaround it in the
kernel, for short lived allocations like gcc to take advantage of
this. I managed to get 200M of gcc (of ~500M total) of translate.o
into hugepages with two glibc params, but I want it all in transhuge
before I measure it. I'm running it on the workstation that had 1 day
and half of uptime and it's still building more packages as I write
this and running large vfs loads in /usr and maildir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:00                                 ` Andrea Arcangeli
@ 2010-04-10 20:10                                   ` Andrea Arcangeli
  2010-04-10 20:21                                   ` Jason Garrett-Glaser
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-10 20:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sat, Apr 10, 2010 at 10:00:37PM +0200, Andrea Arcangeli wrote:
> and we need it to use non temporal stores, but even that will be

To clarify, I mean using temporal stores only on the CPUs with <8M L2
caches, with some of the Xeon preloading the cache may provide an even
further boost to the child with hugepages in addition to the further
longstanding benefits of hugetlb for long lived
allocations.

Furthermore there is also an option (only available when DEBUG_VM is
on, called transparent_hugepage/debug_cow) to COW with 4k copies
(exactly like we have to do if cow fails to allocate an hugepage, it's
the cow fallback) that already eliminates any chance for slowdown in
practice, but I don't recommend it at all, because it may provide a
minor speedup immediately after the cow with l2 cache <4M, but then it
slowdown the child forever and eliminates the more important
longstanding benefits.

And this in general is very nitpick at this point, but I just wanted
to cover all the details I'm aware about of the subtopic you mentioned
for completeness.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:00                                 ` Andrea Arcangeli
  2010-04-10 20:10                                   ` Andrea Arcangeli
@ 2010-04-10 20:21                                   ` Jason Garrett-Glaser
  1 sibling, 0 replies; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-10 20:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

>> (i've Cc:-ed x264 benchmarking experts - in case i missed something)
>
> It definitely worth trying... nice idea. But we need glibc to increase
> vm_end in 2M aligned chunk, otherwise we've to workaround it in the
> kernel, for short lived allocations like gcc to take advantage of
> this. I managed to get 200M of gcc (of ~500M total) of translate.o
> into hugepages with two glibc params, but I want it all in transhuge
> before I measure it. I'm running it on the workstation that had 1 day
> and half of uptime and it's still building more packages as I write
> this and running large vfs loads in /usr and maildir.
>

Just an FYI on this--if you're testing x264, it performs _all_ memory
allocation on init and never mallocs again, so it's a good testbed for
something that uses a lot of memory but doesn't malloc/free a lot.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 19:47                               ` Ingo Molnar
  2010-04-10 20:00                                 ` Andrea Arcangeli
@ 2010-04-10 20:24                                 ` Avi Kivity
  2010-04-10 20:42                                   ` Avi Kivity
  2010-04-11 10:46                                   ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
  1 sibling, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-10 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/10/2010 10:47 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>>> I think what would be needed is some non-virtualization speedup example of
>>> a 'non-special' workload, running on the native/host kernel. 'sort' is an
>>> interesting usecase - could it be patched to use hugepages if it has to
>>> sort through lots of data?
>>>        
>> In fact it works well unpatched, the 6% I measured was with the system sort.
>>      
> Yes - but you intentionally sorted something large - the question is, how big
> is the slowdown with small sizes (if there's a slowdown), where is the
> break-even point (if any)?
>    

There shouldn't be a slowdown as far as I can tell.  The danger IMO is 
to pin down unused pages in a huge page and so increase memory pressure 
artificially.

The point where this starts to win would be more or less when the page 
tables mapping the working set hit the size of the last-level cache, 
multiplied by some loading factor (guess: 0.5).  So if you have  a 4MB 
cache, the win should start at around 1GB working set.


>>> Something like GIMP calculations would be a lot more representative of the
>>> speedup potential. Is it possible to run the GIMP with transparent
>>> hugepages enabled for it?
>>>        
>> I thought of it, but raster work is too regular so speculative execution
>> should hide the tlb fill latency.  It's also easy to code in a way which
>> hides cache effects (no idea if it is actually coded that way).  Sort showed
>> a speedup since it defeats branch prediction and thus the processor cannot
>> pipeline the loop.
>>      
> Would be nice to try because there's a lot of transformations within Gimp -
> and Gimp can be scripted. It's also a test for negatives: if there is an
> across-the-board _lack_ of speedups, it shows that it's not really general
> purpose but more specialistic.
>    

Right, but I don't think I can tell which transforms are likely to be 
sped up.  Also, do people manipulate 500MB images regularly?

A 20MB image won't see a significant improvement (40KB page tables, 
that's chickenfeed).

> If the optimization is specialistic, then that's somewhat of an argument
> against automatic/transparent handling. (even though even if the beneficiaries
> turn out to be only special workloads then transparency still has advantages.)
>    

Well, we know that databases, virtualization, and server-side java win 
from this.  (Oracle won't benefit from this implementation since it 
wants shared, not anonymous, memory, but other databases may).  I'm 
guessing large C++ compiles, and perhaps the new link-time optimization 
feature, will also see a nice speedup.

Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox 
RSS, probably not so far in the future.

>> I thought ray tracers with large scenes should show a nice speedup, but
>> setting this up is beyond my capabilities.
>>      
> Oh, this tickled some memories: x264 compressed encoding can be very cache and
> TLB intense. Something like the encoding of a 350 MB video file:
>
>    wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # NOTE: 350 MB!
>    x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4
>
> would be another thing worth trying with transparent-hugetlb enabled.
>
>    

I'll try it out.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:24                                 ` Avi Kivity
@ 2010-04-10 20:42                                   ` Avi Kivity
  2010-04-10 20:47                                     ` Andrea Arcangeli
  2010-04-10 20:49                                     ` Jason Garrett-Glaser
  2010-04-11 10:46                                   ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
  1 sibling, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-10 20:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/10/2010 11:24 PM, Avi Kivity wrote:
>> Oh, this tickled some memories: x264 compressed encoding can be very 
>> cache and
>> TLB intense. Something like the encoding of a 350 MB video file:
>>
>>    wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # 
>> NOTE: 350 MB!
>>    x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4
>>
>> would be another thing worth trying with transparent-hugetlb enabled.
>>
>
> I'll try it out.
>

3-5% improvement.  I had to tune khugepaged to scan more aggressively 
since the run is so short.  The working set is only ~100MB here though.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:42                                   ` Avi Kivity
@ 2010-04-10 20:47                                     ` Andrea Arcangeli
  2010-04-10 21:00                                       ` Avi Kivity
  2010-04-10 20:49                                     ` Jason Garrett-Glaser
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-10 20:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sat, Apr 10, 2010 at 11:42:44PM +0300, Avi Kivity wrote:
> 3-5% improvement.  I had to tune khugepaged to scan more aggressively 
> since the run is so short.  The working set is only ~100MB here though.

We need to either solve it with a kernel workaround or have an
environment var for glibc to do the right thing...

The best I got so far with gcc is with, about half goes in hugepages
with this but it's not enough as likely lib invoked mallocs goes into
heap and extended 1M at time.

export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024]
export MALLOC_TOP_PAD_=$[1024*1024*1024]

Whatever we do, it has to be possible to disable it of course with
malloc debug options, or with electric fence of course, but it's not
like the default 1M provides any benefit compared to growing it 2M
aligned ;) so it's quite an obvious thing to address in glibc in my
view. Then if it takes too much RAM on small systems echo madvise
>/sys/kernel/mm/transparent_hugepage/enabled will retain the
optimizations in qemu guest physical address space range or other
bits that are guaranteed not to waste memory and that also are a
must-have on embedded that have even smaller l2 caches and slower cpus
where every optimization matters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:42                                   ` Avi Kivity
  2010-04-10 20:47                                     ` Andrea Arcangeli
@ 2010-04-10 20:49                                     ` Jason Garrett-Glaser
  2010-04-10 20:53                                       ` Avi Kivity
  1 sibling, 1 reply; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-10 20:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sat, Apr 10, 2010 at 1:42 PM, Avi Kivity <avi@redhat.com> wrote:
> On 04/10/2010 11:24 PM, Avi Kivity wrote:
>>>
>>> Oh, this tickled some memories: x264 compressed encoding can be very
>>> cache and
>>> TLB intense. Something like the encoding of a 350 MB video file:
>>>
>>>   wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # NOTE:
>>> 350 MB!
>>>   x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4
>>>
>>> would be another thing worth trying with transparent-hugetlb enabled.
>>>
>>
>> I'll try it out.
>>
>
> 3-5% improvement.  I had to tune khugepaged to scan more aggressively since
> the run is so short.  The working set is only ~100MB here though.

I'd try some longer runs with larger datasets to do more testing.

Some things to try:

1) Pick a 1080p or even 2160p sequence from http://media.xiph.org/video/derf/

2) Use --preset ultrafast or similar to do a ridiculously
memory-bandwidth-limited runthrough.

3) Use --preset veryslow or similar to do a very not-memory-limited runthrough.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:49                                     ` Jason Garrett-Glaser
@ 2010-04-10 20:53                                       ` Avi Kivity
  2010-04-10 20:58                                         ` Jason Garrett-Glaser
  2010-04-11  9:29                                         ` Avi Kivity
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-10 20:53 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote:
>
>> 3-5% improvement.  I had to tune khugepaged to scan more aggressively since
>> the run is so short.  The working set is only ~100MB here though.
>>      
> I'd try some longer runs with larger datasets to do more testing.
>
> Some things to try:
>
> 1) Pick a 1080p or even 2160p sequence from http://media.xiph.org/video/derf/
>
>    

Ok, I'm downloading crown_run 2160p, but it will take a while.

> 2) Use --preset ultrafast or similar to do a ridiculously
> memory-bandwidth-limited runthrough.
>
>    

Large pages improve random-access memory bandwidth but don't change 
sequential access.  Which of these does --preset ultrafast change?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:53                                       ` Avi Kivity
@ 2010-04-10 20:58                                         ` Jason Garrett-Glaser
  2010-04-11  9:29                                         ` Avi Kivity
  1 sibling, 0 replies; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-10 20:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sat, Apr 10, 2010 at 1:53 PM, Avi Kivity <avi@redhat.com> wrote:
> On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote:
>>
>>> 3-5% improvement.  I had to tune khugepaged to scan more aggressively
>>> since
>>> the run is so short.  The working set is only ~100MB here though.
>>>
>>
>> I'd try some longer runs with larger datasets to do more testing.
>>
>> Some things to try:
>>
>> 1) Pick a 1080p or even 2160p sequence from
>> http://media.xiph.org/video/derf/
>>
>>
>
> Ok, I'm downloading crown_run 2160p, but it will take a while.

You can always cheat by synthesizing a fake sample like this:

ffmpeg -i input.y4m -s 3840x2160 output.y4m

Or something similar.

Do be careful though; extremely fast presets combined with large input
samples will be disk-bottlenecked, so make sure to keep it small
enough to fit in disk cache and "prime" the cache before testing.

>> 2) Use --preset ultrafast or similar to do a ridiculously
>> memory-bandwidth-limited runthrough.
>>
>>
>
> Large pages improve random-access memory bandwidth but don't change
> sequential access.  Which of these does --preset ultrafast change?

Hmm, I'm not quite sure.  The process is strictly sequential, but
there is clearly enough random access mixed in to cause some sort of
change given your previous test.  The main thing faster presets do is
decrease the amount of "work" done at each step, resulting in roughly
the same amount of memory bandwidth being required for each step--but
in a much shorter period of time.  Most "work" done at each step stays
well within the L2 cache.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:47                                     ` Andrea Arcangeli
@ 2010-04-10 21:00                                       ` Avi Kivity
  2010-04-10 21:47                                         ` Andrea Arcangeli
  2010-04-11  1:05                                         ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-10 21:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/10/2010 11:47 PM, Andrea Arcangeli wrote:
> On Sat, Apr 10, 2010 at 11:42:44PM +0300, Avi Kivity wrote:
>    
>> 3-5% improvement.  I had to tune khugepaged to scan more aggressively
>> since the run is so short.  The working set is only ~100MB here though.
>>      
> We need to either solve it with a kernel workaround or have an
> environment var for glibc to do the right thing...
>
>    

IMO, both.  The kernel should align vmas on 2MB boundaries (good for 
small pages as well).  glibc should use 2MB increments.  Even on <2MB 
sized vmas, the kernel should reserve the large page frame for a while 
in the hope that the application will use it in a short while.

> The best I got so far with gcc is with, about half goes in hugepages
> with this but it's not enough as likely lib invoked mallocs goes into
> heap and extended 1M at time.
>    

There are also guard pages around stacks IIRC, we could make them 2MB on 
x86-64.

> export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024]
> export MALLOC_TOP_PAD_=$[1024*1024*1024]
>
> Whatever we do, it has to be possible to disable it of course with
> malloc debug options, or with electric fence of course, but it's not
> like the default 1M provides any benefit compared to growing it 2M
> aligned ;) so it's quite an obvious thing to address in glibc in my
> view.

Well, but mapping a 2MB vma with a large page could be a considerable 
waste if the application doesn't eventually use it.  I'd like to map the 
pages with small pages (belonging to a large frame) and if the 
application actually uses the pages, switch to a large pte.

Something that can also improve small pages is to prefault the vma with 
small pages, but with the accessed and dirty bit cleared.  Later, we 
check those bits and reclaim the pages if they're unused, or coalesce 
them if they were used.  The nice thing is that we save tons of page 
faults in the common case where the pages are used.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 21:00                                       ` Avi Kivity
@ 2010-04-10 21:47                                         ` Andrea Arcangeli
  2010-04-11  1:05                                         ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-10 21:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sun, Apr 11, 2010 at 12:00:29AM +0300, Avi Kivity wrote:
> IMO, both.  The kernel should align vmas on 2MB boundaries (good for 
> small pages as well).  glibc should use 2MB increments.  Even on <2MB 

Agreed.

> sized vmas, the kernel should reserve the large page frame for a while 
> in the hope that the application will use it in a short while.

I don't see the need of this per-process, and the buddy logic is already
doing exactly that for us... (even without the movable/unmovable
fallback logic)

> There are also guard pages around stacks IIRC, we could make them 2MB on 
> x86-64.

Agreed. That will provide little benefit though, the stack usage is
quite local near the top and few apps stores bulks of data there (hard
to reach even 512k in size). Firefox has a 300k stack. It'll waste >1M
per process. If it grows and the application is long lived khugepaged
takes care of this already. But personally I tend to like a
black/white approach as much as possible, so I agree to make the vma
large enough immediately if enabled = always.

> Well, but mapping a 2MB vma with a large page could be a considerable 
> waste if the application doesn't eventually use it.  I'd like to map the 
> pages with small pages (belonging to a large frame) and if the 
> application actually uses the pages, switch to a large pte.
>
> Something that can also improve small pages is to prefault the vma with 
> small pages, but with the accessed and dirty bit cleared.  Later, we 
> check those bits and reclaim the pages if they're unused, or coalesce 
> them if they were used.  The nice thing is that we save tons of page 
> faults in the common case where the pages are used.

Yeah we could do that. I'm not against it but it's not my preference
to do these things. Anything that introduces the risk of performance
regressions in corner cases frightens me. I prefer to pay with RAM
anytime.

Again I like to keep the design as black/white as possible, if
somebody is ram constrained he shouldn't leave enabled=always but keep
enabled=madvise. That's the whole point of having added a enabled =
madvise, for who is ram constrained but wants to run faster anyway
with zero ram-waste risk.

These days even desktop systems have more ram than needed so I don't
see the big deal, we should squeeze out of the ram every possible CPU
cycle (even in the user stack even if likely not significant and just
a RAM waste) and not waste CPU in pre-fault or migration of 4k to 2M
pages when the vm_end grows and then having to find which unmapped
pages of a hugepage to reclaim after splitting it on the fly.

I want to reduce to the minimum the risk of regressions anywhere when
full transparency is enabled. This also has the benefit of keeping the
kernel code simpler and with less special cases ;).

It may not be ideal if you've a 1G desktop system and you want to run
faster when encoding a movie, but for that there's exactly
madvise(MADV_HUGEPAGE). qemu-kvm/transcode/ffmpeg all can use a little
madvise on their big chunks of memory. khugepaged should also learn to
prioritize on those VM_HUGEPAGE vmas before scanning the rest (which
it doesn't right now to keep it a bit simpler, but obviously there's
room for improvement).

Anyway I think I we can start with aligning the vmas that don't pad
themselfs with previous vma, to 2M size, and have the stack also
aligned so the page faults will fill them automatically. Changing
glibc to grow 2m instead of 1m is a one liner change to a #define and
it'll also halve the number of mmap syscalls so it's quite
strightforward next step. I also need to make it numa aware with an
alloc_pages_vma. Both are simple enough that I can do them right now
without worries. Then we can re-think at making the kernel more
complex. I don't mean it's bad idea, just less obvious than paying
with RAM and be simpler...  I want to be sure this is rock solid
before we go ahead doing more complex stuff. There have been zero
problems so far (backing out the anon-vma changes solved the only bug
that triggered without memory compaction (showing a skew between the
pmd_huge mappings and page_mapcount because of anon-vma errors), and
memory compaction also works great now with the last integration fix ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 21:00                                       ` Avi Kivity
  2010-04-10 21:47                                         ` Andrea Arcangeli
@ 2010-04-11  1:05                                         ` Andrea Arcangeli
  2010-04-11 11:24                                           ` Ingo Molnar
  2010-04-25 19:27                                           ` Andrea Arcangeli
  1 sibling, 2 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-11  1:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

> > export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024]
> > export MALLOC_TOP_PAD_=$[1024*1024*1024]

With the above two params I get around 200M (around half) in
hugepages with gcc building translate.o:

$ rm translate.o ; time make translate.o
  CC    translate.o

real    0m22.900s
user    0m22.601s
sys     0m0.260s
$ rm translate.o ; time make translate.o
  CC    translate.o

real    0m22.405s
user    0m22.125s
sys     0m0.240s
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# exit
$ rm translate.o ; time make translate.o
  CC    translate.o

real    0m24.128s
user    0m23.725s
sys     0m0.376s
$ rm translate.o ; time make translate.o
  CC    translate.o

real    0m24.126s
user    0m23.725s
sys     0m0.376s
$ uptime
 02:36:07 up 1 day, 19:45,  5 users,  load average: 0.01, 0.12, 0.08

1 sec in 24 means around 4% faster, hopefully when glibc will fully
cooperate we'll get better results than the above with gcc...

I tried to emulate it with khugepaged running in a loop and I get
almost the whole gcc anon memory in hugepages this way (as expected):

# echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
# exit
rm translate.o ; time make translate.o
  CC    translate.o

real    0m21.950s
user    0m21.481s
sys     0m0.292s
$ rm translate.o ; time make translate.o
  CC    translate.o

real    0m21.992s
user    0m21.529s
sys     0m0.288s
$ 

So this takes more than 2 seconds away from 24 seconds reproducibly,
and it means gcc now runs 8% faster. This requires running khugepaged
at 100% of one of the four cores but with a slight chance to glibc
we'll be able reach the exact same 8% speedup (or more because this
also involves copying ~200M and sending IPIs to unmap pages and stop
userland during the memory copy that won't be necessary anymore).

BTW, the current default for khugepaged is to scan 8 pmd every 10
seconds, that means collapsing at most 16M every 10 seconds. Checking
8 pmd pointers every 10 seconds and 6 wakeup per minute for a kernel
thread is absolutely unmeasurable but despite the unmeasurable
overhead, it provides for a very nice behavior for long lived
allocations that may have been swapped in fragmented.

This is on phenom X4, I'd be interested if somebody can try on other cpus.

To get the environment of the test just:

git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
cd qemu-kvm
make
cd x86_64-softmmu

export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024]
export MALLOC_TOP_PAD_=$[1024*1024*1024]
rm translate.o; time make translate.o

Then you need to flip the above sysfs controls as I did.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:53                                       ` Avi Kivity
  2010-04-10 20:58                                         ` Jason Garrett-Glaser
@ 2010-04-11  9:29                                         ` Avi Kivity
  2010-04-11  9:37                                           ` Jason Garrett-Glaser
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11  9:29 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On 04/10/2010 11:53 PM, Avi Kivity wrote:
> On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote:
>>
>>> 3-5% improvement.  I had to tune khugepaged to scan more 
>>> aggressively since
>>> the run is so short.  The working set is only ~100MB here though.
>> I'd try some longer runs with larger datasets to do more testing.
>>
>> Some things to try:
>>
>> 1) Pick a 1080p or even 2160p sequence from 
>> http://media.xiph.org/video/derf/
>>
>
> Ok, I'm downloading crown_run 2160p, but it will take a while.
>

# time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
yuv4mpeg: 3840x2160@50/1fps, 1:1

encoded 500 frames, 0.68 fps, 251812.80 kb/s

real    12m17.154s
user    20m39.151s
sys    0m11.727s

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled
# time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
yuv4mpeg: 3840x2160@50/1fps, 1:1

encoded 500 frames, 0.66 fps, 251812.80 kb/s

real    12m37.962s
user    21m13.506s
sys    0m11.696s

Just 2.7%, even though the working set was much larger.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  9:29                                         ` Avi Kivity
@ 2010-04-11  9:37                                           ` Jason Garrett-Glaser
  2010-04-11  9:40                                             ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-11  9:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sun, Apr 11, 2010 at 2:29 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/10/2010 11:53 PM, Avi Kivity wrote:
>>
>> On 04/10/2010 11:49 PM, Jason Garrett-Glaser wrote:
>>>
>>>> 3-5% improvement.  I had to tune khugepaged to scan more aggressively
>>>> since
>>>> the run is so short.  The working set is only ~100MB here though.
>>>
>>> I'd try some longer runs with larger datasets to do more testing.
>>>
>>> Some things to try:
>>>
>>> 1) Pick a 1080p or even 2160p sequence from
>>> http://media.xiph.org/video/derf/
>>>
>>
>> Ok, I'm downloading crown_run 2160p, but it will take a while.
>>
>
> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
> yuv4mpeg: 3840x2160@50/1fps, 1:1
>
> encoded 500 frames, 0.68 fps, 251812.80 kb/s
>
> real    12m17.154s
> user    20m39.151s
> sys    0m11.727s
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # echo never > /sys/kernel/mm/transparent_hugepage/khugepaged/enabled
> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
> yuv4mpeg: 3840x2160@50/1fps, 1:1
>
> encoded 500 frames, 0.66 fps, 251812.80 kb/s
>
> real    12m37.962s
> user    21m13.506s
> sys    0m11.696s
>
> Just 2.7%, even though the working set was much larger.

Did you make sure to check your stddev on those?

I'm also curious how it compares for --preset ultrafast and so forth.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  9:37                                           ` Jason Garrett-Glaser
@ 2010-04-11  9:40                                             ` Avi Kivity
  2010-04-11 10:22                                               ` Jason Garrett-Glaser
  2010-04-11 11:00                                               ` Ingo Molnar
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-11  9:40 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote:
>
>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
>> yuv4mpeg: 3840x2160@50/1fps, 1:1
>>
>> encoded 500 frames, 0.68 fps, 251812.80 kb/s
>>
>> real    12m17.154s
>> user    20m39.151s
>> sys    0m11.727s
>>
>> # echo never>  /sys/kernel/mm/transparent_hugepage/enabled
>> # echo never>  /sys/kernel/mm/transparent_hugepage/khugepaged/enabled
>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
>> yuv4mpeg: 3840x2160@50/1fps, 1:1
>>
>> encoded 500 frames, 0.66 fps, 251812.80 kb/s
>>
>> real    12m37.962s
>> user    21m13.506s
>> sys    0m11.696s
>>
>> Just 2.7%, even though the working set was much larger.
>>      
> Did you make sure to check your stddev on those?
>    

I'm doing another run to look at variability.

> I'm also curious how it compares for --preset ultrafast and so forth.
>    

Is this something realistic or just a benchmark thing?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  9:40                                             ` Avi Kivity
@ 2010-04-11 10:22                                               ` Jason Garrett-Glaser
  2010-04-11 11:00                                               ` Ingo Molnar
  1 sibling, 0 replies; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-11 10:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sun, Apr 11, 2010 at 2:40 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote:
>>
>>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
>>> yuv4mpeg: 3840x2160@50/1fps, 1:1
>>>
>>> encoded 500 frames, 0.68 fps, 251812.80 kb/s
>>>
>>> real    12m17.154s
>>> user    20m39.151s
>>> sys    0m11.727s
>>>
>>> # echo never>  /sys/kernel/mm/transparent_hugepage/enabled
>>> # echo never>  /sys/kernel/mm/transparent_hugepage/khugepaged/enabled
>>> # time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
>>> yuv4mpeg: 3840x2160@50/1fps, 1:1
>>>
>>> encoded 500 frames, 0.66 fps, 251812.80 kb/s
>>>
>>> real    12m37.962s
>>> user    21m13.506s
>>> sys    0m11.696s
>>>
>>> Just 2.7%, even though the working set was much larger.
>>>
>>
>> Did you make sure to check your stddev on those?
>>
>
> I'm doing another run to look at variability.
>
>> I'm also curious how it compares for --preset ultrafast and so forth.
>>
>
> Is this something realistic or just a benchmark thing?

Well, at 2160p, we're already a bit beyond the bounds of ordinary
applications.  Ultrafast is generally an "unrealistically fast"
setting, getting stupid performance levels like 200fps 1080p encoding
(at the cost of incredibly bad compression).  "veryfast" is probably a
more realistic test case (I know many companies using similar levels
of performance).

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 20:24                                 ` Avi Kivity
  2010-04-10 20:42                                   ` Avi Kivity
@ 2010-04-11 10:46                                   ` Ingo Molnar
  2010-04-11 10:49                                     ` Ingo Molnar
  2010-04-11 11:30                                     ` Avi Kivity
  1 sibling, 2 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> On 04/10/2010 10:47 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>>I think what would be needed is some non-virtualization speedup example of
> >>>a 'non-special' workload, running on the native/host kernel. 'sort' is an
> >>>interesting usecase - could it be patched to use hugepages if it has to
> >>>sort through lots of data?
> >>In fact it works well unpatched, the 6% I measured was with the system sort.
> >Yes - but you intentionally sorted something large - the question is, how big
> >is the slowdown with small sizes (if there's a slowdown), where is the
> >break-even point (if any)?
> 
> There shouldn't be a slowdown as far as I can tell. [...]

It does not hurt to double check the before/after micro-cost precisely - it 
would be nice to see a result of:

  perf stat -e instructions --repeat 100 sort /etc/passwd > /dev/null

with and without hugetlb.

Linus is right in that the patches are intrusive, and the answer to that isnt 
to insist that it isnt so (it evidently is so), the correct reply is to 
broaden the utility of the patches and to demonstrate that the feature is 
useful on a much wider spectrum of workloads.

> > Would be nice to try because there's a lot of transformations within Gimp 
> > - and Gimp can be scripted. It's also a test for negatives: if there is an 
> > across-the-board _lack_ of speedups, it shows that it's not really general 
> > purpose but more specialistic.
> 
> Right, but I don't think I can tell which transforms are likely to be sped 
> up.  Also, do people manipulate 500MB images regularly?
> 
> A 20MB image won't see a significant improvement (40KB page tables, that's 
> chickenfeed).

> > If the optimization is specialistic, then that's somewhat of an argument 
> > against automatic/transparent handling. (even though even if the 
> > beneficiaries turn out to be only special workloads then transparency 
> > still has advantages.)
> 
> Well, we know that databases, virtualization, and server-side java win from 
> this.  (Oracle won't benefit from this implementation since it wants shared, 
> not anonymous, memory, but other databases may). I'm guessing large C++ 
> compiles, and perhaps the new link-time optimization feature, will also see 
> a nice speedup.
> 
> Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox 
> RSS, probably not so far in the future.

1-2GB firefox RSS is reality for me.

Btw., there's another workload that could be cache sensitive, 'git grep':

 aldebaran:~/linux> perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca >/dev/null

 Performance counter stats for 'git grep arca' (5 runs):

     1882712774  cycles                     ( +-   0.074% )
     1153649442  instructions             #      0.613 IPC     ( +-   0.005% )
      518815167  dTLB-loads                 ( +-   0.035% )
        3028951  dTLB-load-misses           ( +-   1.223% )

    0.597161428  seconds time elapsed   ( +-   0.065% )

At first sight, with 7 cycles per cold TLB there's about 1.12% of a speedup 
potential in that workload. With just 1 cycle it's 0.16%. The real speedup 
ought to be somewhere inbetween.

Btw., instead of throwing random numbers like '3-4%' into this thread it would 
be nice if you could send 'perf stat --repeat' numbers like i did above - they 
have an error bar, they show the TLB details, they show the cycles and 
instructions proportion and they are also far more precise than 'time' based 
results.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 10:46                                   ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
@ 2010-04-11 10:49                                     ` Ingo Molnar
  2010-04-11 11:30                                     ` Avi Kivity
  1 sibling, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 10:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Ingo Molnar <mingo@elte.hu> wrote:

> > Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox 
> > RSS, probably not so far in the future.
> 
> 1-2GB firefox RSS is reality for me.
> 
> Btw., there's another workload that could be cache sensitive, 'git grep':
> 
>  aldebaran:~/linux> perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca >/dev/null
> 
>  Performance counter stats for 'git grep arca' (5 runs):
> 
>      1882712774  cycles                     ( +-   0.074% )
>      1153649442  instructions             #      0.613 IPC     ( +-   0.005% )
>       518815167  dTLB-loads                 ( +-   0.035% )
>         3028951  dTLB-load-misses           ( +-   1.223% )
> 
>     0.597161428  seconds time elapsed   ( +-   0.065% )

Sidenote: you might want to try the cool new threaded git grep from upstream 
Git project:

 git clone git://git.kernel.org/pub/scm/git/git.git
 cd git
 make -j

Beyond being faster, it will also probably show a bigger hugetlb speedup, as 
the effective per core (and per hyperthread) cache set is smaller than for a 
single-threaded git grep.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  9:40                                             ` Avi Kivity
  2010-04-11 10:22                                               ` Jason Garrett-Glaser
@ 2010-04-11 11:00                                               ` Ingo Molnar
  2010-04-11 11:19                                                 ` Avi Kivity
  1 sibling, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 11:00 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jason Garrett-Glaser, Mike Galbraith, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> On 04/11/2010 12:37 PM, Jason Garrett-Glaser wrote:
> >
> >># time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
> >>yuv4mpeg: 3840x2160@50/1fps, 1:1
> >>
> >>encoded 500 frames, 0.68 fps, 251812.80 kb/s
> >>
> >>real    12m17.154s
> >>user    20m39.151s
> >>sys    0m11.727s
> >>
> >># echo never>  /sys/kernel/mm/transparent_hugepage/enabled
> >># echo never>  /sys/kernel/mm/transparent_hugepage/khugepaged/enabled
> >># time x264 --crf 20 --quiet crowd_run_2160p.y4m -o /dev/null --threads 2
> >>yuv4mpeg: 3840x2160@50/1fps, 1:1
> >>
> >>encoded 500 frames, 0.66 fps, 251812.80 kb/s
> >>
> >>real    12m37.962s
> >>user    21m13.506s
> >>sys    0m11.696s
> >>
> >>Just 2.7%, even though the working set was much larger.
> >Did you make sure to check your stddev on those?
> 
> I'm doing another run to look at variability.

Sigh. Could you please stop using stone-age tools like /usr/bin/time and 
instead use:

 perf stat --repeat 3 x264 ...

you can install it via:

 cd linux
 cd tools/perf/
 make -j install

That way you will see 'variability' (sttdev/error bars/fuzz), and a whole lot 
of other CPU details beyond much more precise measurements:

 $ perf stat --repeat 3 x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 2
 yuv4mpeg: 704x576@60/1fps, 128:117

 encoded 2 frames, 23.47 fps, 39824.64 kb/s
 yuv4mpeg: 704x576@60/1fps, 128:117

 encoded 2 frames, 23.52 fps, 39824.64 kb/s
 yuv4mpeg: 704x576@60/1fps, 128:117

 encoded 2 frames, 23.45 fps, 39824.64 kb/s

 Performance counter stats for 'x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 2' (3 runs):

     130.624286  task-clock-msecs         #      1.496 CPUs    ( +-   0.081% )
             74  context-switches         #      0.001 M/sec   ( +-   7.151% )
              3  CPU-migrations           #      0.000 M/sec   ( +-  25.000% )
           2987  page-faults              #      0.023 M/sec   ( +-   0.162% )
      389234822  cycles                   #   2979.804 M/sec   ( +-   0.081% )
      481360693  instructions             #      1.237 IPC     ( +-   0.036% )
        4206296  cache-references         #     32.201 M/sec   ( +-   0.387% )
          55732  cache-misses             #      0.427 M/sec   ( +-   0.529% )

    0.087336553  seconds time elapsed   ( +-   0.100% )

Note that perf stat will run fine on older [pre-2.6.31] kernels too (it will 
measure elapsed time) and even there it will be much more precise than 
/usr/bin/time.

For more dTLB details, use something like:

 perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 x264 ...

Yes, i know we had a big flamewar about perf kvm, but IMHO that is no reason 
for you to pretend that this tool doesnt exist ;-)

> > I'm also curious how it compares for --preset ultrafast and so forth.
> 
> Is this something realistic or just a benchmark thing?

I'd suggest for you to use the default settings, to make it realistic. (Maybe 
also 'advanced/high-quality' settings that an advanced user would utilize.)

It is no doubt that benchmark advantages can be shown - the point of this 
exercise is to show that there are real-life speedups to various categories of 
non-server apps that hugetlb gives us.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 11:00                                               ` Ingo Molnar
@ 2010-04-11 11:19                                                 ` Avi Kivity
  2010-04-11 11:30                                                   ` Jason Garrett-Glaser
  2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 11:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Mike Galbraith, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/11/2010 02:00 PM, Ingo Molnar wrote:
>>>
>>> Did you make sure to check your stddev on those?
>>>        
>> I'm doing another run to look at variability.
>>      
> Sigh. Could you please stop using stone-age tools like /usr/bin/time and
> instead use:
>    

I did one more run for each setting and got the same results (within a 
second).

> Yes, i know we had a big flamewar about perf kvm, but IMHO that is no reason
> for you to pretend that this tool doesnt exist ;-)
>    

I use it almost daily, not sure why you think I pretend it doesn't exist.

>> Is this something realistic or just a benchmark thing?
>>      
> I'd suggest for you to use the default settings, to make it realistic. (Maybe
> also 'advanced/high-quality' settings that an advanced user would utilize.)
>    

In fact I'm guessing --ultrafast would reduce the gain.  The lower the 
quality, the less time you spend looking at other frames to find 
commonality.  Like bzip2 -1/-9 memory footprint.

> It is no doubt that benchmark advantages can be shown - the point of this
> exercise is to show that there are real-life speedups to various categories of
> non-server apps that hugetlb gives us.
>    

I think hugetlb will mostly help server apps.  Desktop apps simply don't 
have working sets big enough to matter.  There will be exceptions, but 
as a rule, desktop apps won't benefit much from this.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  1:05                                         ` Andrea Arcangeli
@ 2010-04-11 11:24                                           ` Ingo Molnar
  2010-04-11 11:33                                             ` Avi Kivity
  2010-04-25 19:27                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 11:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> So this takes more than 2 seconds away from 24 seconds reproducibly, and it 
> means gcc now runs 8% faster. [...]

That's fantastic if systematic ... i'd give a limb for faster kbuild times in 
the >2% range.

Would be nice to see a precise before/after 'perf stat' comparison:

    perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 ...

that way we can see that the instruction count is roughly the same 
before/after, the cycle count goes down and we can also see the reduction in 
dTLB misses (and other advantages, if any).

Plus, here's a hugetlb usability feature request if you dont mind me 
suggesting it.

This current usage (as root):

    echo never > /sys/kernel/mm/transparent_hugepage/enabled

is fine for testing but it would be also nice to give finegrained per workload 
tunability to such details. It would be _very_ nice to have app-inheritable 
hugetlb attributes plus have a 'hugetlb' tool in tools/hugetlb/, which would 
allow the per workload tuning of hugetlb uses. For example:

    hugetlb ctl --never ./my-workload.sh

would disable hugetlb usage in my-workload.sh (and all sub-processes). 
Running:

    hugetlb ctl --always ./my-workload.sh

would enable it. [or something like that - maybe there are better naming schemes]

Other commands:

    hugetlb stat

would show current allocation stats, etc.

Currently you have the 'hugetlbctl' app but IMO it limits the useful command 
space to 'control' ops only - it would be _much_ better to use the Git model: 
to name the tool in a much more generic way ('hugetlb' - the project name), 
and then let sub-commands be added like Git (and perf ;-) does.

Git has more than 70 subcommands currently, trend growing. That command model 
scales and works well for smaller projects like perf (or hugetlb) as well.

Anyway, was just a suggestion.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 10:46                                   ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
  2010-04-11 10:49                                     ` Ingo Molnar
@ 2010-04-11 11:30                                     ` Avi Kivity
  2010-04-11 12:08                                       ` Ingo Molnar
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 11:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/11/2010 01:46 PM, Ingo Molnar wrote:
>
>> There shouldn't be a slowdown as far as I can tell. [...]
>>      
> It does not hurt to double check the before/after micro-cost precisely - it
> would be nice to see a result of:
>
>    perf stat -e instructions --repeat 100 sort /etc/passwd>  /dev/null
>
> with and without hugetlb.
>    

With:

         1036752  instructions             #      0.000 IPC     ( +-   
0.092% )

Without:

         1036844  instructions             #      0.000 IPC     ( +-   
0.100% )

> Linus is right in that the patches are intrusive, and the answer to that isnt
> to insist that it isnt so (it evidently is so),

No one is insisting the patches aren't intrusive.  We're insisting they 
bring a real benefit.  I think Linus' main objection was that hugetlb 
wouldn't work due to fragmentation, and I think we've demonstrated that 
antifrag/compaction do allow hugetlb to work even during a fragmenting 
workload running in parallel.

> the correct reply is to
> broaden the utility of the patches and to demonstrate that the feature is
> useful on a much wider spectrum of workloads.
>    

That's probably not the case.  I don't expect a significant improvement 
in desktop experience.  The benefit will be for workloads with large 
working sets and random access to memory.

>> Well, we know that databases, virtualization, and server-side java win from
>> this.  (Oracle won't benefit from this implementation since it wants shared,
>> not anonymous, memory, but other databases may). I'm guessing large C++
>> compiles, and perhaps the new link-time optimization feature, will also see
>> a nice speedup.
>>
>> Desktops will only benefit when they bloat to ~8GB RAM and 1-2GB firefox
>> RSS, probably not so far in the future.
>>      
> 1-2GB firefox RSS is reality for me.
>    

Mine usually crashes sooner...  interestingly, its vmas are heavily 
fragmented:

00007f97f1500000   2048K rw---    [ anon ]
00007f97f1800000   1024K rw---    [ anon ]
00007f97f1a00000   1024K rw---    [ anon ]
00007f97f1c00000   2048K rw---    [ anon ]
00007f97f1f00000   1024K rw---    [ anon ]
00007f97f2100000   1024K rw---    [ anon ]
00007f97f2300000   1024K rw---    [ anon ]
00007f97f2500000   1024K rw---    [ anon ]
00007f97f2700000   1024K rw---    [ anon ]
00007f97f2900000   1024K rw---    [ anon ]
00007f97f2b00000   2048K rw---    [ anon ]
00007f97f2e00000   2048K rw---    [ anon ]
00007f97f3100000   1024K rw---    [ anon ]
00007f97f3300000   1024K rw---    [ anon ]
00007f97f3500000   1024K rw---    [ anon ]
00007f97f3700000   1024K rw---    [ anon ]
00007f97f3900000   2048K rw---    [ anon ]
00007f97f3c00000   2048K rw---    [ anon ]
00007f97f3f00000   1024K rw---    [ anon ]

So hugetlb won't work out-of-the-box on firefox.

> Btw., there's another workload that could be cache sensitive, 'git grep':
>
>   aldebaran:~/linux>  perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 5 git grep arca>/dev/null
>
>   Performance counter stats for 'git grep arca' (5 runs):
>
>       1882712774  cycles                     ( +-   0.074% )
>       1153649442  instructions             #      0.613 IPC     ( +-   0.005% )
>        518815167  dTLB-loads                 ( +-   0.035% )
>          3028951  dTLB-load-misses           ( +-   1.223% )
>
>      0.597161428  seconds time elapsed   ( +-   0.065% )
>
> At first sight, with 7 cycles per cold TLB there's about 1.12% of a speedup
> potential in that workload. With just 1 cycle it's 0.16%. The real speedup
> ought to be somewhere inbetween.
>    

'git grep' is a pagecache workload, not anonymous memory, so it 
shouldn't see any improvement.  I imagine git will see a nice speedup if 
we get hugetlb for pagecache, at least for read-only workloads that 
don't hash all the time.

> Btw., instead of throwing random numbers like '3-4%' into this thread it would
> be nice if you could send 'perf stat --repeat' numbers like i did above - they
> have an error bar, they show the TLB details, they show the cycles and
> instructions proportion and they are also far more precise than 'time' based
> results.
>    

Sure.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 11:19                                                 ` Avi Kivity
@ 2010-04-11 11:30                                                   ` Jason Garrett-Glaser
  2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
  1 sibling, 0 replies; 205+ messages in thread
From: Jason Garrett-Glaser @ 2010-04-11 11:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura

On Sun, Apr 11, 2010 at 4:19 AM, Avi Kivity <avi@redhat.com> wrote:
> On 04/11/2010 02:00 PM, Ingo Molnar wrote:
>>>>
>>>> Did you make sure to check your stddev on those?
>>>>
>>>
>>> I'm doing another run to look at variability.
>>>
>>
>> Sigh. Could you please stop using stone-age tools like /usr/bin/time and
>> instead use:
>>
>
> I did one more run for each setting and got the same results (within a
> second).
>
>> Yes, i know we had a big flamewar about perf kvm, but IMHO that is no
>> reason
>> for you to pretend that this tool doesnt exist ;-)
>>
>
> I use it almost daily, not sure why you think I pretend it doesn't exist.
>
>>> Is this something realistic or just a benchmark thing?
>>>
>>
>> I'd suggest for you to use the default settings, to make it realistic.
>> (Maybe
>> also 'advanced/high-quality' settings that an advanced user would
>> utilize.)
>>
>
> In fact I'm guessing --ultrafast would reduce the gain.  The lower the
> quality, the less time you spend looking at other frames to find
> commonality.  Like bzip2 -1/-9 memory footprint.

The main thing that controls how much obnoxious fetching of past
frames you're doing is --ref.  This is 3 by default, 1 at all the
faster settings, and goes as high as 16 on the very slow ones.  Do
also note that at very slow settings, the lookahead eats up a
phenomenal amount of memory and bandwidth due to its O(--bframes^2 *
--rc-lookahead) viterbi analysis.

Just for reference, since you're looking at practical applications,
here's approximate presets used by various companies I work with that
care a lot about performance and run Linux:

The Criterion Collection (encoding web versions of films, blu-ray
authoring): Veryslow
Zencoder (high-quality web transcoding service): Slow
Facebook (fast-turnaround web video): Medium
Avail Media (live, realtime HD television broadcast): Fast
Gaikai (interactive, ultra-low-latency, web video): Veryfast

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 11:24                                           ` Ingo Molnar
@ 2010-04-11 11:33                                             ` Avi Kivity
  2010-04-11 12:11                                               ` Ingo Molnar
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 11:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/11/2010 02:24 PM, Ingo Molnar wrote:
> * Andrea Arcangeli<aarcange@redhat.com>  wrote:
>
>    
>> So this takes more than 2 seconds away from 24 seconds reproducibly, and it
>> means gcc now runs 8% faster. [...]
>>      
> That's fantastic if systematic ... i'd give a limb for faster kbuild times in
> the>2% range.
>
> Would be nice to see a precise before/after 'perf stat' comparison:
>
>      perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 ...
>
> that way we can see that the instruction count is roughly the same
> before/after, the cycle count goes down and we can also see the reduction in
> dTLB misses (and other advantages, if any).
>
> Plus, here's a hugetlb usability feature request if you dont mind me
> suggesting it.
>
> This current usage (as root):
>
>      echo never>  /sys/kernel/mm/transparent_hugepage/enabled
>
> is fine for testing but it would be also nice to give finegrained per workload
> tunability to such details. It would be _very_ nice to have app-inheritable
> hugetlb attributes plus have a 'hugetlb' tool in tools/hugetlb/, which would
> allow the per workload tuning of hugetlb uses. For example:
>
>      hugetlb ctl --never ./my-workload.sh
>
> would disable hugetlb usage in my-workload.sh (and all sub-processes).
> Running:
>
>      hugetlb ctl --always ./my-workload.sh
>
> would enable it. [or something like that - maybe there are better naming schemes]
>    

I would like to see transparent hugetlb enabled by default for all 
workloads, and good enough so that users don't need to tweak it at all.  
May not happen for the initial merge, but certainly later.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* hugepages will matter more in the future
  2010-04-11 11:19                                                 ` Avi Kivity
  2010-04-11 11:30                                                   ` Jason Garrett-Glaser
@ 2010-04-11 11:52                                                   ` Ingo Molnar
  2010-04-11 12:01                                                     ` Avi Kivity
                                                                       ` (2 more replies)
  1 sibling, 3 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 11:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jason Garrett-Glaser, Mike Galbraith, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven


* Avi Kivity <avi@redhat.com> wrote:

> > It is no doubt that benchmark advantages can be shown - the point of this 
> > exercise is to show that there are real-life speedups to various 
> > categories of non-server apps that hugetlb gives us.
> 
> I think hugetlb will mostly help server apps.  Desktop apps simply don't 
> have working sets big enough to matter.  There will be exceptions, but as a 
> rule, desktop apps won't benefit much from this.

Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a 
phone these days easily has more than 512 MB RAM.) Andrea measured 
multi-percent improvement in gcc performance. I think it's real.

Also note that IMO hugetlbs will matter _more_ in the future, even if CPU 
designers do a perfect job and CPU caches stay well-balanced to typical 
working sets: because RAM size is increasing somewhat faster than CPU cache 
size, due to the different physical constraints that CPUs face.

A quick back-of-the-envelope estimation: 20 years ago the high-end desktop had 
4MB of RAM and 64K of a cache [1:64 proportion], today it has 16 GB of RAM and 
8 MB of L2 cache on the CPU [1:2048 proportion].

App working sets track typical RAM sizes [it is their primary limit], not 
typical CPU cache sizes.

So while RAM size is exploding, CPU cache sizes cannot grow that fast and 
there's an increasing 'gap' between the pagetable size of higher-end 
RAM-filling workloads and CPU cache sizes - which gap the CPU itself cannot 
possibly close or mitigate in the future.

Also, the proportion of 4K:2MB is a fixed constant, and CPUs dont grow their 
TLB caches as much as typical RAM size grows: they'll grow it according to the 
_mean_ working set size - while the 'max' working set gets larger and larger 
due to the increasing [proportional] gap to RAM size.

Put in a different way: this slow, gradual phsyical process causes data-cache 
misses to become 'colder and colder': in essence a portion of the worst-case 
TLB miss cost gets added to the average data-cache miss cost on more and more 
workloads. (Even without any nested-pagetables or other virtualization 
considerations.) The CPU can do nothing about this - even if it stays in a 
golden balance with typical workloads.

Hugetlbs were ridiculous 10 years ago, but are IMO real today. My prediction 
is that in 5-10 years we'll be thinking about 1GB pages for certain HPC apps 
and 2MB pages will be common on the desktop.

This is why i think we should think about hugetlb support today and this is 
why i think we should consider elevating hugetlbs to the next level of 
built-in Linux VM support.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
@ 2010-04-11 12:01                                                     ` Avi Kivity
  2010-04-11 12:35                                                       ` Ingo Molnar
  2010-04-11 15:22                                                     ` Linus Torvalds
  2010-04-12 11:22                                                     ` Arjan van de Ven
  2 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 12:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Mike Galbraith, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven

On 04/11/2010 02:52 PM, Ingo Molnar wrote:
>
> Put in a different way: this slow, gradual phsyical process causes data-cache
> misses to become 'colder and colder': in essence a portion of the worst-case
> TLB miss cost gets added to the average data-cache miss cost on more and more
> workloads. (Even without any nested-pagetables or other virtualization
> considerations.) The CPU can do nothing about this - even if it stays in a
> golden balance with typical workloads.
>    

This is the essence and which is why we really need transparent 
hugetlb.  Both the tlb and the caches are way to small to handle the 
millions of pages that are common now.

> This is why i think we should think about hugetlb support today and this is
> why i think we should consider elevating hugetlbs to the next level of
> built-in Linux VM support.
>    

Agreed, with s/today/yesterday/.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 11:30                                     ` Avi Kivity
@ 2010-04-11 12:08                                       ` Ingo Molnar
  2010-04-11 12:24                                         ` Avi Kivity
  2010-04-12  6:09                                         ` Nick Piggin
  0 siblings, 2 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 12:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> On 04/11/2010 01:46 PM, Ingo Molnar wrote:
> >
> >>There shouldn't be a slowdown as far as I can tell. [...]
> >It does not hurt to double check the before/after micro-cost precisely - it
> >would be nice to see a result of:
> >
> >   perf stat -e instructions --repeat 100 sort /etc/passwd>  /dev/null
> >
> >with and without hugetlb.
> 
> With:
> 
>         1036752  instructions             #      0.000 IPC     ( +-
> 0.092% )
> 
> Without:
> 
>         1036844  instructions             #      0.000 IPC     ( +-
> 0.100% )
> 
> > Linus is right in that the patches are intrusive, and the answer to that 
> > isnt to insist that it isnt so (it evidently is so),
> 
> No one is insisting the patches aren't intrusive.  We're insisting they 
> bring a real benefit.  I think Linus' main objection was that hugetlb 
> wouldn't work due to fragmentation, and I think we've demonstrated that 
> antifrag/compaction do allow hugetlb to work even during a fragmenting 
> workload running in parallel.

As i understood it i think Linus had three main objections:

 1- the improvements were only shown in specialistic environments
    (virtualization, servers)

 2- complexity

 3- futility: defrag is hard and theoretically impossible

1) numbers were too specialistic

I think if some more numbers are gathered and if hugetlb/nohugetlb is made a 
bit more configurable (on a per workload basis) then this concern is fairly 
addressed.

2) complexity

There's probably not much to be done about this. It's a cost/benefit tradeoff 
decision, i.e. depends on the other two factors.

3) futility

I think Andrea and Mel and you demonstrated that while defrag is futile in 
theory (we can always fill up all of RAM with dentries and there's no 2MB 
allocation possible), it seems rather usable in practice.

> > the correct reply is to broaden the utility of the patches and to 
> > demonstrate that the feature is useful on a much wider spectrum of 
> > workloads.
> 
> That's probably not the case.  I don't expect a significant improvement in 
> desktop experience.  The benefit will be for workloads with large working 
> sets and random access to memory.

See my previous mail about the 'RAM gap' - i think it matters more than you 
think.

The important thing to realize is that the working set of the 'desktop' is 
_not_ independent of RAM size: it just fills up RAM to the 'typical average 
RAM size'. That is around 2 GB today. In 5-10 years it will be at 16 GB.

Applications will just bloat up to that natural size. They'll use finer 
default resolutions, larger internal caches, etc. etc.

So IMO it all matters to the desktop too and is not just a server feature. We 
saw this again and again: today's server scalability limitation is tomorrow's 
desktop scalability limitation.

> Mine usually crashes sooner...  interestingly, its vmas are heavily
> fragmented:
> 
> 00007f97f1500000   2048K rw---    [ anon ]
> 00007f97f1800000   1024K rw---    [ anon ]
> 00007f97f1a00000   1024K rw---    [ anon ]
> 00007f97f1c00000   2048K rw---    [ anon ]
> 00007f97f1f00000   1024K rw---    [ anon ]
> 00007f97f2100000   1024K rw---    [ anon ]
> 00007f97f2300000   1024K rw---    [ anon ]
> 00007f97f2500000   1024K rw---    [ anon ]
> 00007f97f2700000   1024K rw---    [ anon ]
> 00007f97f2900000   1024K rw---    [ anon ]
> 00007f97f2b00000   2048K rw---    [ anon ]
> 00007f97f2e00000   2048K rw---    [ anon ]
> 00007f97f3100000   1024K rw---    [ anon ]
> 00007f97f3300000   1024K rw---    [ anon ]
> 00007f97f3500000   1024K rw---    [ anon ]
> 00007f97f3700000   1024K rw---    [ anon ]
> 00007f97f3900000   2048K rw---    [ anon ]
> 00007f97f3c00000   2048K rw---    [ anon ]
> 00007f97f3f00000   1024K rw---    [ anon ]
> 
> So hugetlb won't work out-of-the-box on firefox.

Hm, seems to have 1MB holes between them.

Half of them are 2MB in size, but half of them are not properly aligned. So 
about 33% of firefox's anon memory is hugepage-able straight away - still 
nonzero.

(Plus maybe if this comes from glibc then it could be handled by patching 
glibc.)

> 'git grep' is a pagecache workload, not anonymous memory, so it shouldn't 
> see any improvement. [...]

Indeed, git grep is read() based.

> [...]  I imagine git will see a nice speedup if we get hugetlb for 
> pagecache, at least for read-only workloads that don't hash all the time.

Shouldnt that already be the case today? The pagecache is in the kernel where 
we have things 2MB mapped. Git read()s it into the same [small] buffer again 
and again, so the only 'wide' address space access it does is within the 
kernel, to the 2MB mapped pagecache pages.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 11:33                                             ` Avi Kivity
@ 2010-04-11 12:11                                               ` Ingo Molnar
  0 siblings, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 12:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> I would like to see transparent hugetlb enabled by default for all 
> workloads, and good enough so that users don't need to tweak it at all.  May 
> not happen for the initial merge, but certainly later.

Definitely agreed with that - the feature doesnt make sense without that kind 
of automatic default. Either it _can_ handle to be the default just fine and 
give us advantages on a broad basis, or if not then it's not worth merging.

Nevertheless allowing an opt-out on a finegrained basis would still be nice in 
general. The default is powerful enough of a force in itself - a finegrained 
opt-out does not hurt that advantage, it only improves utility.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 12:08                                       ` Ingo Molnar
@ 2010-04-11 12:24                                         ` Avi Kivity
  2010-04-11 12:46                                           ` Ingo Molnar
  2010-04-12  6:09                                         ` Nick Piggin
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 12:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/11/2010 03:08 PM, Ingo Molnar wrote:
>
>> No one is insisting the patches aren't intrusive.  We're insisting they
>> bring a real benefit.  I think Linus' main objection was that hugetlb
>> wouldn't work due to fragmentation, and I think we've demonstrated that
>> antifrag/compaction do allow hugetlb to work even during a fragmenting
>> workload running in parallel.
>>      
> As i understood it i think Linus had three main objections:
>
>   1- the improvements were only shown in specialistic environments
>      (virtualization, servers)
>    

Servers are not specialized workloads, and neither is virtualization.  
If we have to justify everything based on the desktop experience we'd 
have no 4096 core support, fibre channel and 10GbE drivers, a zillion 
architectures etc.

>   2- complexity
>    

No arguing with that.

> The important thing to realize is that the working set of the 'desktop' is
> _not_ independent of RAM size: it just fills up RAM to the 'typical average
> RAM size'. That is around 2 GB today. In 5-10 years it will be at 16 GB.
>
> Applications will just bloat up to that natural size. They'll use finer
> default resolutions, larger internal caches, etc. etc.
>    

Well, if this happens we'll be ready.

>> 'git grep' is a pagecache workload, not anonymous memory, so it shouldn't
>> see any improvement. [...]
>>      
> Indeed, git grep is read() based.
>    

Right.

>> [...]  I imagine git will see a nice speedup if we get hugetlb for
>> pagecache, at least for read-only workloads that don't hash all the time.
>>      
> Shouldnt that already be the case today? The pagecache is in the kernel where
> we have things 2MB mapped. Git read()s it into the same [small] buffer again
> and again, so the only 'wide' address space access it does is within the
> kernel, to the 2MB mapped pagecache pages.
>    

If you 'git grep pattern $commit' instead, you'll be reading out of 
mmap()ed git packs.  Much of git memory access goes through that.  To 
get the benefit of hugetlb there, we'd need to run khugepaged on 
pagecache, and align file vmas on 2MB boundaries.

We'll also get executables and shared objects mapped via large pages 
this way, the ELF ABI is already set up to align sections on 2MB boundaries.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 12:01                                                     ` Avi Kivity
@ 2010-04-11 12:35                                                       ` Ingo Molnar
  0 siblings, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 12:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jason Garrett-Glaser, Mike Galbraith, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven


* Avi Kivity <avi@redhat.com> wrote:

> On 04/11/2010 02:52 PM, Ingo Molnar wrote:
> >
> > Put in a different way: this slow, gradual phsyical process causes 
> > data-cache misses to become 'colder and colder': in essence a portion of 
> > the worst-case TLB miss cost gets added to the average data-cache miss 
> > cost on more and more workloads. (Even without any nested-pagetables or 
> > other virtualization considerations.) The CPU can do nothing about this - 
> > even if it stays in a golden balance with typical workloads.
> 
> This is the essence and which is why we really need transparent hugetlb.  
> Both the tlb and the caches are way to small to handle the millions of pages 
> that are common now.
>
> > This is why i think we should think about hugetlb support today and this 
> > is why i think we should consider elevating hugetlbs to the next level of 
> > built-in Linux VM support.
> 
> Agreed, with s/today/yesterday/.

Well, yes - with the caveat that i think yesterday's hugetlb patches were 
notwhere close to being mergable. (and were nowhere close to addressing the 
problems to begin with)

Andrea's patches are IMHO a game changer because they are the first thing that 
has the chance to improve a large category of workloads.

We saw it that the 10-years-old hugetlbfs and libhugetlb experiments alone 
helped very little: a Linux-only opt-in performance feature that takes effort 
[and admin space configuration ...] on the app side will almost never be taken 
advantage of to make a visible difference to the end result - it simply doesnt 
scale as a development and deployment model.

The most important thing the past 10 years of kernel development have taught 
us are that transparent, always-available, zero-app-effort kernel features are 
king. The rest barely exists.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 12:24                                         ` Avi Kivity
@ 2010-04-11 12:46                                           ` Ingo Molnar
  0 siblings, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-11 12:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mike Galbraith, Jason Garrett-Glaser, Andrea Arcangeli,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> On 04/11/2010 03:08 PM, Ingo Molnar wrote:
> >
> >> No one is insisting the patches aren't intrusive.  We're insisting they 
> >> bring a real benefit.  I think Linus' main objection was that hugetlb 
> >> wouldn't work due to fragmentation, and I think we've demonstrated that 
> >> antifrag/compaction do allow hugetlb to work even during a fragmenting 
> >> workload running in parallel.
> >
> > As i understood it i think Linus had three main objections:
> >
> >  1- the improvements were only shown in specialistic environments
> >     (virtualization, servers)
> 
> Servers are not specialized workloads, and neither is virtualization. [...]

As far as kernel development goes they are. ( In fact in the past few years 
virtualization has grown the nasty habbit of sometimes _hindering_ upstream 
kernel development ... I hope that will change. )

> > Applications will just bloat up to that natural size. They'll use finer 
> > default resolutions, larger internal caches, etc. etc.
> 
> Well, if this happens we'll be ready.

That's what happened in the past 20 years, and i can see no signs of that 
process stopping anytime soon.

[ Note, 'apps bloat up to natural RAM size' is a heavy simplification with a
  somewhat derogatory undertone: in reality what happens is that apps just
  grow along what are basically random vectors, and if a vector hits across
  the RAM limit [and causing a visible slowdown due to bloat] there is a 
  _pushback_ from developers/testers/users.

  The end result is that app working sets are clipped to somewhat below the 
  typical desktop RAM size, but rarely are they debloated to much below that 
  practical average threshold. So in essence 'apps fill up available RAM'. ]

Just like car traffic 'fills up' available road capacity. If there's enough 
road capacity [and fuel prices are not too high] then families (and 
businesses) will have second and third cars and wont bother optimizing their 
driving patterns.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
  2010-04-11 12:01                                                     ` Avi Kivity
@ 2010-04-11 15:22                                                     ` Linus Torvalds
  2010-04-11 15:43                                                       ` Avi Kivity
  2010-04-11 19:40                                                       ` Andrea Arcangeli
  2010-04-12 11:22                                                     ` Arjan van de Ven
  2 siblings, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-11 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven



On Sun, 11 Apr 2010, Ingo Molnar wrote:
> 
> Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a 
> phone these days easily has more than 512 MB RAM.) Andrea measured 
> multi-percent improvement in gcc performance. I think it's real.

Reality check: he got multiple percent with 

 - one huge badly written file being compiled that took 22s because it's 
   such a horrible monster.

 - magic libc malloc flags tghat are totally and utterly unrealistic in 
   anything but a benchmark

 - by basically keeping one CPU totally busy doing defragmentation.

Quite frankly, that kind of "performance analysis" makes me _less_ 
interested rather than more. Because all it shows is that you're willing 
to do anything at all to get better numbers, regardless of whether it is 
_realistic_ or not.

Seriously, guys.  Get a grip. If you start talking about special malloc 
algorithms, you have ALREADY LOST. Google for memory fragmentation with 
various malloc implementations in multi-threaded applications. Thinking 
that you can just allocate in 2MB chunks is so _fundamnetally_ broken that 
this whole thread should have been laughed out of the room.

Instead, you guys egg each other on.

Stop the f*cking circle-jerk already.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:22                                                     ` Linus Torvalds
@ 2010-04-11 15:43                                                       ` Avi Kivity
  2010-04-11 15:52                                                         ` Linus Torvalds
  2010-04-11 19:40                                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven

On 04/11/2010 06:22 PM, Linus Torvalds wrote:
>
> On Sun, 11 Apr 2010, Ingo Molnar wrote:
>    
>> Both Xorg, xterms and firefox have rather huge RSS's on my boxes. (Even a
>> phone these days easily has more than 512 MB RAM.) Andrea measured
>> multi-percent improvement in gcc performance. I think it's real.
>>      
> Reality check: he got multiple percent with
>
>   - one huge badly written file being compiled that took 22s because it's
>     such a horrible monster.
>    

Not everything is a kernel build.  Template heavy C++ code will also 
allocate tons of memory.  gcc -flto will also want lots of memory.

>   - magic libc malloc flags tghat are totally and utterly unrealistic in
>     anything but a benchmark
>    

Having glibc allocate in chunks of 2MB instead of 1MB is not 
unrealistic.  I agree about MMAP_THRESHOLD.

>   - by basically keeping one CPU totally busy doing defragmentation.
>    

I never saw khugepaged take any significant amount of cpu.

> Quite frankly, that kind of "performance analysis" makes me _less_
> interested rather than more. Because all it shows is that you're willing
> to do anything at all to get better numbers, regardless of whether it is
> _realistic_ or not.
>
> Seriously, guys.  Get a grip. If you start talking about special malloc
> algorithms, you have ALREADY LOST. Google for memory fragmentation with
> various malloc implementations in multi-threaded applications. Thinking
> that you can just allocate in 2MB chunks is so _fundamnetally_ broken that
> this whole thread should have been laughed out of the room.
>    

And yet Oracle and java have options to use large pages, and we know 
google and HPC like 'em.  Maybe they just haven't noticed the 
fundamental brokenness yet.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:43                                                       ` Avi Kivity
@ 2010-04-11 15:52                                                         ` Linus Torvalds
  2010-04-11 16:04                                                           ` Avi Kivity
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-11 15:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven



On Sun, 11 Apr 2010, Avi Kivity wrote:
> 
> And yet Oracle and java have options to use large pages, and we know google
> and HPC like 'em.  Maybe they just haven't noticed the fundamental brokenness
> yet.

The thing is, what you are advocating is what traditional UNIX did. 
Prioritizing the special cases rather than the generic workloads.

And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead 
exactly _because_ it prioritized those kinds of loads.

I'm perfectly happy to take specialized workloads into account, but it 
needs to help the _normal_ case too. Somebody mentioned 4k CPU support as 
an example, and that's a good example. The only reason we support 4k CPU's 
is that the code was made clean enough to work with them and actually 
help clean up the SMP code in general.

I've also seen Andrea talk about how it's all rock solid. We _know_ that 
is wrong, because the anon_vma bug is not solved. That bug apparently 
happens under low-memory situations, so clearly nobody has really stressed 
the low-memory case.

So here's the deal: make the code cleaner, and it's fine. And stop trying 
to sell it with _crap_.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:52                                                         ` Linus Torvalds
@ 2010-04-11 16:04                                                           ` Avi Kivity
  2010-04-12  7:45                                                             ` Ingo Molnar
  2010-04-11 19:35                                                           ` Andrea Arcangeli
  2010-04-12 16:20                                                           ` Rik van Riel
  2 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-11 16:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven

On 04/11/2010 06:52 PM, Linus Torvalds wrote:
>
> On Sun, 11 Apr 2010, Avi Kivity wrote:
>    
>> And yet Oracle and java have options to use large pages, and we know google
>> and HPC like 'em.  Maybe they just haven't noticed the fundamental brokenness
>> yet.
>>      
> The thing is, what you are advocating is what traditional UNIX did.
> Prioritizing the special cases rather than the generic workloads.
>
> And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead
> exactly _because_ it prioritized those kinds of loads.
>    

This is not a specialized workload.  Plenty of sites are running java, 
plenty of sites are running Oracle (though that won't benefit from 
anonymous hugepages), and plenty of sites are running virtualization.  
Not everyone does two kernel builds before breakfast.

> I'm perfectly happy to take specialized workloads into account, but it
> needs to help the _normal_ case too. Somebody mentioned 4k CPU support as
> an example, and that's a good example. The only reason we support 4k CPU's
> is that the code was made clean enough to work with them and actually
> help clean up the SMP code in general.
>
> I've also seen Andrea talk about how it's all rock solid. We _know_ that
> is wrong, because the anon_vma bug is not solved. That bug apparently
> happens under low-memory situations, so clearly nobody has really stressed
> the low-memory case.
>    

Well, nothing is rock solid until it's had a few months in the hands of 
users.

> So here's the deal: make the code cleaner, and it's fine. And stop trying
> to sell it with _crap_.
>    

That's perfectly reasonable.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:52                                                         ` Linus Torvalds
  2010-04-11 16:04                                                           ` Avi Kivity
@ 2010-04-11 19:35                                                           ` Andrea Arcangeli
  2010-04-12 16:20                                                           ` Rik van Riel
  2 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-11 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Arjan van de Ven

On Sun, Apr 11, 2010 at 08:52:10AM -0700, Linus Torvalds wrote:
> is wrong, because the anon_vma bug is not solved. That bug apparently 

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=2acc64e8da017045039f30b926efac1f5c4bd82a

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:22                                                     ` Linus Torvalds
  2010-04-11 15:43                                                       ` Avi Kivity
@ 2010-04-11 19:40                                                       ` Andrea Arcangeli
  2010-04-12 15:41                                                         ` Linus Torvalds
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-11 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Avi Kivity, Jason Garrett-Glaser, Mike Galbraith,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Arjan van de Ven

On Sun, Apr 11, 2010 at 08:22:04AM -0700, Linus Torvalds wrote:
>  - magic libc malloc flags tghat are totally and utterly unrealistic in 
>    anything but a benchmark
> 
>  - by basically keeping one CPU totally busy doing defragmentation.

This is a red herring. This is the last thing we want, and we'll run
even faster if we could make current glibc binaries to cooperate. But
this is a new feature and it'll require changing glibc slightly.

Future glibc will be optimal and it won't require khugepaged don't
worry.

I got crashes in page_mapcount != number of huge_pmd mapping the page
in split_huge_page because of the anon-vma bug, so I had to back it
out, this is why it's stable now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11 12:08                                       ` Ingo Molnar
  2010-04-11 12:24                                         ` Avi Kivity
@ 2010-04-12  6:09                                         ` Nick Piggin
  2010-04-12  6:18                                           ` Pekka Enberg
                                                             ` (3 more replies)
  1 sibling, 4 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  6:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> 3) futility
> 
> I think Andrea and Mel and you demonstrated that while defrag is futile in 
> theory (we can always fill up all of RAM with dentries and there's no 2MB 
> allocation possible), it seems rather usable in practice.

One problem is that you need to keep a lot more memory free in order
for it to be reasonably effective. Another thing is that the problem
of fragmentation breakdown is not just a one-shot event that fills
memory with pinned objects. It is a slow degredation.

Especially when you use something like SLUB as the memory allocator
which requires higher order allocations for objects which are pinned
in kernel memory.

Just running a few minutes of testing with a kernel compile in the
background does not show the full picture. You really need a box that
has been up for days running a proper workload before you are likely
to see any breakdown.

I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
get slower after X days of uptime. It's better to have consistent
performance really, for anything except pure benchmark setups.

Defrag is not futile in theory, you just have to either have a reserve
of movable pages (and never allow pinned kernel pages in there), or
you need to allocate pinned kernel memory in units of the chunk size
goal (which just gives you different types of fragmentation problems)
or you need to do non-linear kernel mappings so you can defrag pinned
kernel memory (with *lots* of other problems of course). So you just
have a lot of downsides.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:09                                         ` Nick Piggin
@ 2010-04-12  6:18                                           ` Pekka Enberg
  2010-04-12  6:48                                             ` Nick Piggin
  2010-04-12 14:29                                             ` Christoph Lameter
  2010-04-12  6:36                                           ` Avi Kivity
                                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 205+ messages in thread
From: Pekka Enberg @ 2010-04-12  6:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 9:09 AM, Nick Piggin <npiggin@suse.de> wrote:
>> I think Andrea and Mel and you demonstrated that while defrag is futile in
>> theory (we can always fill up all of RAM with dentries and there's no 2MB
>> allocation possible), it seems rather usable in practice.
>
> One problem is that you need to keep a lot more memory free in order
> for it to be reasonably effective. Another thing is that the problem
> of fragmentation breakdown is not just a one-shot event that fills
> memory with pinned objects. It is a slow degredation.
>
> Especially when you use something like SLUB as the memory allocator
> which requires higher order allocations for objects which are pinned
> in kernel memory.

I guess we'd need to merge the SLUB defragmentation patches to fix that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:09                                         ` Nick Piggin
  2010-04-12  6:18                                           ` Pekka Enberg
@ 2010-04-12  6:36                                           ` Avi Kivity
  2010-04-12  6:55                                             ` Ingo Molnar
                                                               ` (2 more replies)
  2010-04-12  6:49                                           ` Ingo Molnar
  2010-04-12  7:08                                           ` Andrea Arcangeli
  3 siblings, 3 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12  6:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 09:09 AM, Nick Piggin wrote:
> On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote:
>    
>> * Avi Kivity<avi@redhat.com>  wrote:
>>
>> 3) futility
>>
>> I think Andrea and Mel and you demonstrated that while defrag is futile in
>> theory (we can always fill up all of RAM with dentries and there's no 2MB
>> allocation possible), it seems rather usable in practice.
>>      
> One problem is that you need to keep a lot more memory free in order
> for it to be reasonably effective.

It's the usual space-time tradeoff.  You don't want to do it on a 
netbook, but it's worth it on a 16GB server, which is already not very 
high end.

> Another thing is that the problem
> of fragmentation breakdown is not just a one-shot event that fills
> memory with pinned objects. It is a slow degredation.
>
> Especially when you use something like SLUB as the memory allocator
> which requires higher order allocations for objects which are pinned
> in kernel memory.
>    

Won't the usual antifrag tactics apply?  Try to allocate those objects 
from the same block.

> Just running a few minutes of testing with a kernel compile in the
> background does not show the full picture. You really need a box that
> has been up for days running a proper workload before you are likely
> to see any breakdown.
>    

I'm sure we'll be able to generate worst-case scenarios.  I'm also 
reasonably sure we'll be able to deal with them.  I hope we won't need 
to, but it's even possible to move dentries around.

> I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> get slower after X days of uptime. It's better to have consistent
> performance really, for anything except pure benchmark setups.
>    

If that were the case we'd disable caches everywhere.  General purpose 
computing is a best effort thing, we try to be fast on the common case 
but we'll be slow on the uncommon case.  Access to a bit of memory can 
take 3 ns if it's in cache, 100 ns if not, and 3 ms if it's on disk.

Here, the uncommon case will be really uncommon, most applications (that 
can benefit from large pages) I'm aware of don't switch from large 
anonymous working sets to a dcache load of many tiny files.  They tend 
to keep doing the same thing over and over again.

I'm not saying we don't need to adapt to changing conditions (we do, 
especially for kvm, that's what khugepaged is for), but as long as we 
have a graceful fallback, we don't need to worry too much about failure 
in extreme conditions.

> Defrag is not futile in theory, you just have to either have a reserve
> of movable pages (and never allow pinned kernel pages in there), or
> you need to allocate pinned kernel memory in units of the chunk size
> goal (which just gives you different types of fragmentation problems)
> or you need to do non-linear kernel mappings so you can defrag pinned
> kernel memory (with *lots* of other problems of course). So you just
> have a lot of downsides.
>    

Non-linear kernel mapping moves the small page problem from userspace 
back to the kernel, a really unhappy solution.

Very large (object count, not object size) kernel caches can be 
addressed by compacting them, but I hope we won't need to do that.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:18                                           ` Pekka Enberg
@ 2010-04-12  6:48                                             ` Nick Piggin
  2010-04-12 14:29                                             ` Christoph Lameter
  1 sibling, 0 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  6:48 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 09:18:56AM +0300, Pekka Enberg wrote:
> On Mon, Apr 12, 2010 at 9:09 AM, Nick Piggin <npiggin@suse.de> wrote:
> >> I think Andrea and Mel and you demonstrated that while defrag is futile in
> >> theory (we can always fill up all of RAM with dentries and there's no 2MB
> >> allocation possible), it seems rather usable in practice.
> >
> > One problem is that you need to keep a lot more memory free in order
> > for it to be reasonably effective. Another thing is that the problem
> > of fragmentation breakdown is not just a one-shot event that fills
> > memory with pinned objects. It is a slow degredation.
> >
> > Especially when you use something like SLUB as the memory allocator
> > which requires higher order allocations for objects which are pinned
> > in kernel memory.
> 
> I guess we'd need to merge the SLUB defragmentation patches to fix that?

No that's a different problem. And SLUB 'defragmentation' isn't really
defragmentation, it is just selective reclaim.

Reclaimable slab memory allocations are not the problem. The problem are
the ones that you can't reclaim. The problem is this:

- Memory gets fragmented by allocation of pinned pages within larger
  ranges so that we cannot allocate that large range.

- Anti-frag improves this by putting pinned pages in different ranges
  and unpinned pages in different ranges. So the ranges of unpinned
  pages can get reclaimed to use a larger range.

- However there is still an underlying problem of pinned pages causing
  fragmentation within their ranges.

- If you require higher order allocations for pinned pages especially,
  then you will end up with your pinned ranges becoming fragmented and
  unable to satisfy the higher order allocation. So you must expand your
  pinned ranges into unpinned.

If you only do 4K slab allocations, then things get better, however it
can of course still break down if the pinned allocation requirement
grows large. It's really hard to control this because it includes
anything from open files to radix tree nodes to page tables and anything
that any driver or subsystem allocates with kmalloc.

Basically, if you were going to add another level of indirection to
solve that, you may as well just go ahead and do nonlinear mappings of
the kernel memory with page tables, so you'd only have to fix up places
that require translated addresses rather than everything that touches
KVA. This would still be a big headache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:09                                         ` Nick Piggin
  2010-04-12  6:18                                           ` Pekka Enberg
  2010-04-12  6:36                                           ` Avi Kivity
@ 2010-04-12  6:49                                           ` Ingo Molnar
  2010-04-12  7:35                                             ` Andrea Arcangeli
  2010-04-12  7:08                                           ` Andrea Arcangeli
  3 siblings, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  6:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura


* Nick Piggin <npiggin@suse.de> wrote:

> [...]
> 
> Just running a few minutes of testing with a kernel compile in the 
> background does not show the full picture. You really need a box that has 
> been up for days running a proper workload before you are likely to see any 
> breakdown.

AFAIK that's what Andrea has done as a test - but yes, i agree that 
fragmentation is the main design worry.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:36                                           ` Avi Kivity
@ 2010-04-12  6:55                                             ` Ingo Molnar
  2010-04-12  7:15                                             ` Nick Piggin
  2010-04-12  7:18                                             ` Andrea Arcangeli
  2 siblings, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  6:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> > Defrag is not futile in theory, you just have to either have a reserve of 
> > movable pages (and never allow pinned kernel pages in there), or you need 
> > to allocate pinned kernel memory in units of the chunk size goal (which 
> > just gives you different types of fragmentation problems) or you need to 
> > do non-linear kernel mappings so you can defrag pinned kernel memory (with 
> > *lots* of other problems of course). So you just have a lot of downsides.
> 
> Non-linear kernel mapping moves the small page problem from userspace back 
> to the kernel, a really unhappy solution.

Note that in a theoretical sense a specific variant of non-linear kernel 
mappings is already implemented here and toda and is productized: it's called 
virtualization.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:09                                         ` Nick Piggin
                                                             ` (2 preceding siblings ...)
  2010-04-12  6:49                                           ` Ingo Molnar
@ 2010-04-12  7:08                                           ` Andrea Arcangeli
  2010-04-12  7:21                                             ` Nick Piggin
  3 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  7:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote:
> One problem is that you need to keep a lot more memory free in order
> for it to be reasonably effective. Another thing is that the problem
> of fragmentation breakdown is not just a one-shot event that fills
> memory with pinned objects. It is a slow degredation.

set_recommended_min_free_kbytes seems to not be in function of ram
size, 60MB aren't such a big deal.

> Especially when you use something like SLUB as the memory allocator
> which requires higher order allocations for objects which are pinned
> in kernel memory.
> 
> Just running a few minutes of testing with a kernel compile in the
> background does not show the full picture. You really need a box that
> has been up for days running a proper workload before you are likely
> to see any breakdown.
> 
> I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> get slower after X days of uptime. It's better to have consistent
> performance really, for anything except pure benchmark setups.

All data I provided is very real, in addition to building a ton of
packages and running emerge on /usr/portage I've been running all my
real loads. Only problem I only run it for 1 day and half, but the
load I kept it under was significant (surely a lot bigger inode/dentry
load that any hypervisor usage would ever generate).

> Defrag is not futile in theory, you just have to either have a reserve
> of movable pages (and never allow pinned kernel pages in there), or
> you need to allocate pinned kernel memory in units of the chunk size
> goal (which just gives you different types of fragmentation problems)
> or you need to do non-linear kernel mappings so you can defrag pinned
> kernel memory (with *lots* of other problems of course). So you just
> have a lot of downsides.

That's what the kernelcore= option does no? Isn't that a good enough
math guarantee? Probably we should use it in hypervisor products just
in case, to be math-guaranted to never have to use VM migration as
fallback (but definitive) defrag algorithm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:36                                           ` Avi Kivity
  2010-04-12  6:55                                             ` Ingo Molnar
@ 2010-04-12  7:15                                             ` Nick Piggin
  2010-04-12  7:45                                               ` Avi Kivity
  2010-04-12  7:51                                               ` Ingo Molnar
  2010-04-12  7:18                                             ` Andrea Arcangeli
  2 siblings, 2 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  7:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 09:36:23AM +0300, Avi Kivity wrote:
> On 04/12/2010 09:09 AM, Nick Piggin wrote:
> >On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote:
> >>* Avi Kivity<avi@redhat.com>  wrote:
> >>
> >>3) futility
> >>
> >>I think Andrea and Mel and you demonstrated that while defrag is futile in
> >>theory (we can always fill up all of RAM with dentries and there's no 2MB
> >>allocation possible), it seems rather usable in practice.
> >One problem is that you need to keep a lot more memory free in order
> >for it to be reasonably effective.
> 
> It's the usual space-time tradeoff.  You don't want to do it on a
> netbook, but it's worth it on a 16GB server, which is already not
> very high end.

Possibly.

 
> >Another thing is that the problem
> >of fragmentation breakdown is not just a one-shot event that fills
> >memory with pinned objects. It is a slow degredation.
> >
> >Especially when you use something like SLUB as the memory allocator
> >which requires higher order allocations for objects which are pinned
> >in kernel memory.
> 
> Won't the usual antifrag tactics apply?  Try to allocate those
> objects from the same block.

"try" is the key point.

 
> >Just running a few minutes of testing with a kernel compile in the
> >background does not show the full picture. You really need a box that
> >has been up for days running a proper workload before you are likely
> >to see any breakdown.
> 
> I'm sure we'll be able to generate worst-case scenarios.  I'm also
> reasonably sure we'll be able to deal with them.  I hope we won't
> need to, but it's even possible to move dentries around.

Pinned dentries? (which are the problem) That would be insane.

 
> >I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> >get slower after X days of uptime. It's better to have consistent
> >performance really, for anything except pure benchmark setups.
> 
> If that were the case we'd disable caches everywhere.  General

No we wouldn't. You can have consistent, predictable performance with
caches.

> purpose computing is a best effort thing, we try to be fast on the
> common case but we'll be slow on the uncommon case.  Access to a bit

Sure. And the common case for production systems like VM or databse
servers that are up for hundreds of days is when they are running with
a lot of uptime. Common case is not a fresh reboot into a 3 hour
benchmark setup.


> of memory can take 3 ns if it's in cache, 100 ns if not, and 3 ms if
> it's on disk.
> 
> Here, the uncommon case will be really uncommon, most applications
> (that can benefit from large pages) I'm aware of don't switch from
> large anonymous working sets to a dcache load of many tiny files.
> They tend to keep doing the same thing over and over again.
> 
> I'm not saying we don't need to adapt to changing conditions (we do,
> especially for kvm, that's what khugepaged is for), but as long as
> we have a graceful fallback, we don't need to worry too much about
> failure in extreme conditions.
> 
> >Defrag is not futile in theory, you just have to either have a reserve
> >of movable pages (and never allow pinned kernel pages in there), or
> >you need to allocate pinned kernel memory in units of the chunk size
> >goal (which just gives you different types of fragmentation problems)
> >or you need to do non-linear kernel mappings so you can defrag pinned
> >kernel memory (with *lots* of other problems of course). So you just
> >have a lot of downsides.
> 
> Non-linear kernel mapping moves the small page problem from
> userspace back to the kernel, a really unhappy solution.

Not unhappy for userspace intensive workloads. And user working sets
I'm sure are growing faster than kernel working set. Also there would
be nothing against compacting and merging kernel memory into larger
pages.


> Very large (object count, not object size) kernel caches can be
> addressed by compacting them, but I hope we won't need to do that.

You can't say that fragmentation is not a fundamental problem.  And
adding things like indirect pointers or weird crap adding complexity
to code that deals with KVA IMO is not acceptable. So you can't
just assert that you can "address" the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:36                                           ` Avi Kivity
  2010-04-12  6:55                                             ` Ingo Molnar
  2010-04-12  7:15                                             ` Nick Piggin
@ 2010-04-12  7:18                                             ` Andrea Arcangeli
  2 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  7:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 09:36:23AM +0300, Avi Kivity wrote:
> On 04/12/2010 09:09 AM, Nick Piggin wrote:
> > On Sun, Apr 11, 2010 at 02:08:00PM +0200, Ingo Molnar wrote:
> >    
> >> * Avi Kivity<avi@redhat.com>  wrote:
> >>
> >> 3) futility
> >>
> >> I think Andrea and Mel and you demonstrated that while defrag is futile in
> >> theory (we can always fill up all of RAM with dentries and there's no 2MB
> >> allocation possible), it seems rather usable in practice.
> >>      
> > One problem is that you need to keep a lot more memory free in order
> > for it to be reasonably effective.
> 
> It's the usual space-time tradeoff.  You don't want to do it on a 
> netbook, but it's worth it on a 16GB server, which is already not very 
> high end.

Agreed. BTW, if booting with transparent_hugepage=0,
set_recommended_min_free_kbyte in-kernel logic won't run automatically
during the late_initcall invocation.

> Non-linear kernel mapping moves the small page problem from userspace 
> back to the kernel, a really unhappy solution.

Yeah, so we have hugepages in userland but we lose them in kernel ;)
and we run kmalloc as slow as vmalloc ;). I think kernelcore= here is
the answer when somebody asks the math guarantee. We should just focus
on providing a math guarantee with kernelcore= and be done with it.

Limiting the unmovable caches to a certain amount of RAM is orders of
magnitude magnitude more flexible and transparent (and absolutely
unnoticeable) than having to limit only hugepages (so unusable as
regular anon memory, or regular pagecache, or any other movable
entitiy) to a certain amount at boot (plus not being able to swap
them, having to mount filesystems, using LD_PRELOAD tricks
etc...). Furthermore with hypervisor usage the unmovable stuff really
isn't a big deal (1G is more than enough for that even on monster
servers) and we'll never care or risk to hit on the limit. All we need
is the movable memory to grow freely and dynamically and being able to
spread all over the RAM of the system automatically as needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:08                                           ` Andrea Arcangeli
@ 2010-04-12  7:21                                             ` Nick Piggin
  2010-04-12  7:50                                               ` Avi Kivity
  2010-04-12  8:06                                               ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  7:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote:
> On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote:
> > One problem is that you need to keep a lot more memory free in order
> > for it to be reasonably effective. Another thing is that the problem
> > of fragmentation breakdown is not just a one-shot event that fills
> > memory with pinned objects. It is a slow degredation.
> 
> set_recommended_min_free_kbytes seems to not be in function of ram
> size, 60MB aren't such a big deal.
> 
> > Especially when you use something like SLUB as the memory allocator
> > which requires higher order allocations for objects which are pinned
> > in kernel memory.
> > 
> > Just running a few minutes of testing with a kernel compile in the
> > background does not show the full picture. You really need a box that
> > has been up for days running a proper workload before you are likely
> > to see any breakdown.
> > 
> > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> > get slower after X days of uptime. It's better to have consistent
> > performance really, for anything except pure benchmark setups.
> 
> All data I provided is very real, in addition to building a ton of
> packages and running emerge on /usr/portage I've been running all my
> real loads. Only problem I only run it for 1 day and half, but the
> load I kept it under was significant (surely a lot bigger inode/dentry
> load that any hypervisor usage would ever generate).

OK, but as a solution for some kind of very specific and highly
optimized application already like RDBMS, HPC, hypervisor or JVM,
they could just be using hugepages themselves, couldn't they?

It seems more interesting as a more general speedup for applications
that can't afford such optimizations? (eg. the common case for
most people)

 
> > Defrag is not futile in theory, you just have to either have a reserve
> > of movable pages (and never allow pinned kernel pages in there), or
> > you need to allocate pinned kernel memory in units of the chunk size
> > goal (which just gives you different types of fragmentation problems)
> > or you need to do non-linear kernel mappings so you can defrag pinned
> > kernel memory (with *lots* of other problems of course). So you just
> > have a lot of downsides.
> 
> That's what the kernelcore= option does no? Isn't that a good enough
> math guarantee? Probably we should use it in hypervisor products just
> in case, to be math-guaranted to never have to use VM migration as
> fallback (but definitive) defrag algorithm.

Yes we do have the option to reserve pages and as far as I know it
should work, although I can't remember whether it deals with mlock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:49                                           ` Ingo Molnar
@ 2010-04-12  7:35                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  7:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 08:49:40AM +0200, Ingo Molnar wrote:
> AFAIK that's what Andrea has done as a test - but yes, i agree that 
> fragmentation is the main design worry.

Well, I didn't only run a kernel compile for a couple of minutes to
show how memory compaction + in-kernel set_recommended_min_free_kbytes
behaved on my system. I can't claim my numbers are conclusive as it
only run for 1 day and half but there was some real unmovable load on
it. Plus uptime isn't the only variable, if you use the kernel to
create an hypervisor product, you can leave it running VM for a much
longer time than 1 day, and it won't ever generate the amount of
unmovable load that I generated in one day and half I guess.

I built a ton of packages including gcc, bison (which in javac
triggered the anon-vma bug before I backed it out) quite some other
stuff that come as a regular update with a couple of emerge world like
kvirc and stuff like that. There was mutt on lkml and linux-mm maildir
with some hundred thousand inodes for the email, and a dozen kernel
builds and git checkouts to verify my aa.git tree. That's what I can
recall. After 1 day and half I still had ~80% of the not allocated ram
in order 9 and maybe ~75% (by memory, could have been more or less I
don't remember exactly but I posted the exact buddyinfo so you can
calculate yourself if curious) in order 10 == MAX_ORDER. The vast
majority of the free ram was in order 10 after echo 3 >drop_caches and
echo >compact_memory, which simulates the maximum ability of the VM to
generate hugepages dynamically (of course it won't ever create such a
totally compacted buddyinfo at runtime as we don't want to shrink or
compact stuff unless it's really needed). Likely if I killed mutt and
other running apps and I would have run drop_caches and memory
compaction again I would have gotten an even higher ratio as result of
more memory being freeable.

One day and half isn't enough, but it was initial data, and then I had
to reboot into a new #20 release to test a memleak fix I did in
do_huge_pmd_wp_page_fallback... I'll try to run it for a longer time
now. I guess I'll be rebuilding quite some glibc on my system as we
optimize it for the kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:15                                             ` Nick Piggin
@ 2010-04-12  7:45                                               ` Avi Kivity
  2010-04-12  8:28                                                 ` Nick Piggin
  2010-04-12  7:51                                               ` Ingo Molnar
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-12  7:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 10:15 AM, Nick Piggin wrote:
>
>>> Another thing is that the problem
>>> of fragmentation breakdown is not just a one-shot event that fills
>>> memory with pinned objects. It is a slow degredation.
>>>
>>> Especially when you use something like SLUB as the memory allocator
>>> which requires higher order allocations for objects which are pinned
>>> in kernel memory.
>>>        
>> Won't the usual antifrag tactics apply?  Try to allocate those
>> objects from the same block.
>>      
> "try" is the key point.
>    

We use the "try" tactic extensively.  So long as there's a reasonable 
chance of success, and a reasonable fallback on failure, it's fine.

Do you think we won't have reasonable success rates?  Why?

>
>    
>>> Just running a few minutes of testing with a kernel compile in the
>>> background does not show the full picture. You really need a box that
>>> has been up for days running a proper workload before you are likely
>>> to see any breakdown.
>>>        
>> I'm sure we'll be able to generate worst-case scenarios.  I'm also
>> reasonably sure we'll be able to deal with them.  I hope we won't
>> need to, but it's even possible to move dentries around.
>>      
> Pinned dentries? (which are the problem) That would be insane.
>    

Why?  If you can isolate all the pointers into the dentry, allocate the 
new dentry, make the old one point into the new one, hash it, move the 
pointers, drop the old dentry.

Difficult, yes, but insane?

>>> I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
>>> get slower after X days of uptime. It's better to have consistent
>>> performance really, for anything except pure benchmark setups.
>>>        
>> If that were the case we'd disable caches everywhere.  General
>>      
> No we wouldn't. You can have consistent, predictable performance with
> caches.
>    

Caches have statistical performance.  In the long run they average out.  
In the short run they can behave badly.  Same thing with large pages, 
except the runs are longer and the wins are smaller.

>> purpose computing is a best effort thing, we try to be fast on the
>> common case but we'll be slow on the uncommon case.  Access to a bit
>>      
> Sure. And the common case for production systems like VM or databse
> servers that are up for hundreds of days is when they are running with
> a lot of uptime. Common case is not a fresh reboot into a 3 hour
> benchmark setup.
>    

Database are the easiest case, they allocate memory up front and don't 
give it up.  We'll coalesce their memory immediately and they'll run 
happily ever after.

Virtualization will fragment on overcommit, but the load is all 
anonymous memory, so it's easy to defragment.  Very little dcache on the 
host.

>> Non-linear kernel mapping moves the small page problem from
>> userspace back to the kernel, a really unhappy solution.
>>      
> Not unhappy for userspace intensive workloads. And user working sets
> I'm sure are growing faster than kernel working set. Also there would
> be nothing against compacting and merging kernel memory into larger
> pages.
>    

Well, I'm not against it, but that would be a much more intrusive change 
than what this thread is about.  Also, you'd need 4K dentries etc, no?

>> Very large (object count, not object size) kernel caches can be
>> addressed by compacting them, but I hope we won't need to do that.
>>      
> You can't say that fragmentation is not a fundamental problem.  And
> adding things like indirect pointers or weird crap adding complexity
> to code that deals with KVA IMO is not acceptable. So you can't
> just assert that you can "address" the problem.
>    

Mostly we need a way of identifying pointers into a data structure, like 
rmap (after all that's what makes transparent hugepages work).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 16:04                                                           ` Avi Kivity
@ 2010-04-12  7:45                                                             ` Ingo Molnar
  2010-04-12  8:14                                                               ` Nick Piggin
  0 siblings, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  7:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Arjan van de Ven


* Avi Kivity <avi@redhat.com> wrote:

> On 04/11/2010 06:52 PM, Linus Torvalds wrote:
> >
> >On Sun, 11 Apr 2010, Avi Kivity wrote:
> >>
> >> And yet Oracle and java have options to use large pages, and we know 
> >> google and HPC like 'em.  Maybe they just haven't noticed the fundamental 
> >> brokenness yet.

( Add Firefox to the mix too - it too allocates in 1MB/2MB chunks. Perhaps 
  Xorg as well. )

> > The thing is, what you are advocating is what traditional UNIX did. 
> > Prioritizing the special cases rather than the generic workloads.
> >
> > And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead 
> > exactly _because_ it prioritized those kinds of loads.
> 
> This is not a specialized workload.  Plenty of sites are running java, 
> plenty of sites are running Oracle (though that won't benefit from anonymous 
> hugepages), and plenty of sites are running virtualization.  Not everyone 
> does two kernel builds before breakfast.

Java/virtualization/DBs, and, to a certain sense Firefox have basically become 
meta-kernels: they offer their own intermediate APIs to their own style of 
apps - and those apps generally have no direct access to the native Linux 
kernel.

And just like the native kernel has been enjoying the benefits of 2MB pages 
for more than a decade, do these other entities want to enjoy similar benefits 
as well. Fair is fair.

Like it or not, combined end-user attention/work spent in these meta-kernels 
is rising steadily, while apps written in raw C are becoming the exception.

So IMHO we really have roughly three logical choices:

 1) either we accept that the situation is the fault of our technology and 
    subsequently we reform and modernize the Linux syscall ABIs to be more 
    friendly to apps (offer built-in GC and perhaps JIT concepts, perhaps 
    offer a compiler, offer a wider range of libraries with better 
    integration, etc.)

 2) or we accept the fact that the application space is shifting to the
    meta-kernels - and then we should agressively optimize Linux for those
    meta-kernels and not pretend that they are 'specialized'. They literally
    represent tens of thousands of applications apiece.

 3) or we should continue to muddle through somewhere in the middle, hoping 
    that the 'pure C apps' win in the end (despite 10 years of a decline) and
    pretend that the meta-kernels are just 'specialized' workloads.

Right now we are doing 3) and i think it's delusive and a mistake. I think we 
should be doing 1) - but failing that we have to be honest and do 2).

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:21                                             ` Nick Piggin
@ 2010-04-12  7:50                                               ` Avi Kivity
  2010-04-12  8:07                                                 ` Ingo Molnar
  2010-04-12  8:18                                                 ` Andrea Arcangeli
  2010-04-12  8:06                                               ` Andrea Arcangeli
  1 sibling, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12  7:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Ingo Molnar, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/12/2010 10:21 AM, Nick Piggin wrote:
>>
>> All data I provided is very real, in addition to building a ton of
>> packages and running emerge on /usr/portage I've been running all my
>> real loads. Only problem I only run it for 1 day and half, but the
>> load I kept it under was significant (surely a lot bigger inode/dentry
>> load that any hypervisor usage would ever generate).
>>      
> OK, but as a solution for some kind of very specific and highly
> optimized application already like RDBMS, HPC, hypervisor or JVM,
> they could just be using hugepages themselves, couldn't they?
>
> It seems more interesting as a more general speedup for applications
> that can't afford such optimizations? (eg. the common case for
> most people)
>    

The problem with hugetlbfs is that you need to commit upfront to using 
it, and that you need to be the admin.  For virtualization, you want to 
use hugepages when there is no memory pressure, but you want to use ksm, 
ballooning, and swapping when there is (and then go back to large pages 
when pressure is relieved, e.g. by live migration).

HPC and databases can probably live with hugetlbfs.  JVM is somewhere in 
the middle, they do allocate memory dynamically.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:15                                             ` Nick Piggin
  2010-04-12  7:45                                               ` Avi Kivity
@ 2010-04-12  7:51                                               ` Ingo Molnar
  1 sibling, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  7:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura


* Nick Piggin <npiggin@suse.de> wrote:

> [...] Common case is not a fresh reboot into a 3 hour benchmark setup.

Again - that's not what Andrea has done as a test: he has tested an atypically 
intense workload for more than a day.

Which, if it's true, is good enough as far as i'm concerned - even if we 
assume that it deteriorates after 2 days of uptime. If after a day of intense 
uptime it's still usable then a few seconds of a dcache compaction run (spread 
out over a day) doesnt look unrealistic.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:21                                             ` Nick Piggin
  2010-04-12  7:50                                               ` Avi Kivity
@ 2010-04-12  8:06                                               ` Andrea Arcangeli
  2010-04-12 10:44                                                 ` Mel Gorman
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  8:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Avi Kivity, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 05:21:44PM +1000, Nick Piggin wrote:
> On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote:
> > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote:
> > > One problem is that you need to keep a lot more memory free in order
> > > for it to be reasonably effective. Another thing is that the problem
> > > of fragmentation breakdown is not just a one-shot event that fills
> > > memory with pinned objects. It is a slow degredation.
> > 
> > set_recommended_min_free_kbytes seems to not be in function of ram
> > size, 60MB aren't such a big deal.
> > 
> > > Especially when you use something like SLUB as the memory allocator
> > > which requires higher order allocations for objects which are pinned
> > > in kernel memory.
> > > 
> > > Just running a few minutes of testing with a kernel compile in the
> > > background does not show the full picture. You really need a box that
> > > has been up for days running a proper workload before you are likely
> > > to see any breakdown.
> > > 
> > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> > > get slower after X days of uptime. It's better to have consistent
> > > performance really, for anything except pure benchmark setups.
> > 
> > All data I provided is very real, in addition to building a ton of
> > packages and running emerge on /usr/portage I've been running all my
> > real loads. Only problem I only run it for 1 day and half, but the
> > load I kept it under was significant (surely a lot bigger inode/dentry
> > load that any hypervisor usage would ever generate).
> 
> OK, but as a solution for some kind of very specific and highly
> optimized application already like RDBMS, HPC, hypervisor or JVM,
> they could just be using hugepages themselves, couldn't they?
>
> It seems more interesting as a more general speedup for applications
> that can't afford such optimizations? (eg. the common case for
> most people)

The reality is that very few are using hugetlbfs. I guess maybe 0.1%
of KVM instances on phenom/nahlem chips are running on hugetlbfs for
example (hugetlbfs boot reservation doesn't fit the cloud where you
need all ram available in hugetlbfs and you still need 100% of unused
ram as host pagecache for VDI), despite it would provide a >=6% boosts
to all VM no matter what's running on the guest. Same goes for the
JVM, maybe 0.1% of those runs on hugetlbfs. The commercial DBMS are
the exception and they're probably closer to 99% running on hugetlbfs
(and they've to keep using hugetlbfs until we move transparent
hugepages in tmpfs). But as

So there's a ton of wasted energy in my view. Like Ingo said, the
faster they make the chips and the cheaper the RAM becomes, the more
wasted energy as result of not using hugetlbfs. There's always more
difference between cache sizes and ram sizes and also more difference
between cache speeds and ram speeds. I don't see this trend ending and
I'm not sure what is the better CPU that will make hugetlbfs worthless
and unselectable at kernel configure time on x86 arch (if you build
without generic).

And I don't think it's feasible to ship a distro where 99% of apps
that can benefit from hugepages are running with
LD_PRELOAD=libhugetlbfs.so. It has to be transparent if we want to
stop the waste.

The main reason I've always been skeptical about transparent hugepages
before I started working on this is the mess they generate on the
whole kernel. So my priority of course has been to keep it self
contained as much as possible. It kept spilling over and over until I
managed to confine it to anonymous pages and fix whole mm/.c files
with just a one liner (even the hugepage aware implementation that
Johannes did still takes advantage of split_huge_page_pmd if the
mprotect start/end isn't 2M naturally aligned, just to show how
complex it would be to do it all at once). This will allow us to reach
a solid base, and then later move to tmpfs and maybe later to
pagecache and swapcache too. Pretending the whole kernel to become
hugepage aware at once is a total mess, gup would need to return only
head pages for example and breaking hundred of drivers in just that
change. The compound_lock can be removed after you fix all those
hundred of drivers and subsystems using gup... No big deal to remove
it later, kind of you're removing the big kernel lock these days after
14 years of when it has been introduced.

Plus I did all I could to try to keep it as black and white as
possible. I think other OS are more gray in their approaches, my
priority has been to pay for RAM anywhere I could if you set
enabled=always, and to decrease as much as I could any risk of
performance regressions in any workload. These days we can afford to
lose 1G without much worry if it speedup the workload 8%, so I think
the other designs are better for old hardware RAM constrainted and not
very actual. On embedded with my patchset one should set
enabled=madvise. Ingo suggested a per-process tweak to enable it
selectively on certain apps, that is feasible too in the future (so
people won't be forced to modify binaries to add madvise if they can't
leave enabled=always).

> Yes we do have the option to reserve pages and as far as I know it
> should work, although I can't remember whether it deals with mlock.

I think that is the right route to take for who needs the
math-guarantees, and for many products it won't even be noticeable to
enforce the math guarantee. It's kind of overcommit, somebody prefers
the = 2 version and maybe they don't even notice it allows them to
allocate less memory. Others prefers to be able to allocate ram
without accounting for the unused virtual regions despite the bigger
chance to run into the oom killer (and I'm in the latter camp for both
overcommit sysctl and kernelcore= ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:50                                               ` Avi Kivity
@ 2010-04-12  8:07                                                 ` Ingo Molnar
  2010-04-12  8:21                                                   ` Andrea Arcangeli
  2010-04-12 10:27                                                   ` Mel Gorman
  2010-04-12  8:18                                                 ` Andrea Arcangeli
  1 sibling, 2 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  8:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Andrea Arcangeli, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Avi Kivity <avi@redhat.com> wrote:

> On 04/12/2010 10:21 AM, Nick Piggin wrote:
> >>
> >>All data I provided is very real, in addition to building a ton of
> >>packages and running emerge on /usr/portage I've been running all my
> >>real loads. Only problem I only run it for 1 day and half, but the
> >>load I kept it under was significant (surely a lot bigger inode/dentry
> >>load that any hypervisor usage would ever generate).
> >OK, but as a solution for some kind of very specific and highly
> >optimized application already like RDBMS, HPC, hypervisor or JVM,
> >they could just be using hugepages themselves, couldn't they?
> >
> > It seems more interesting as a more general speedup for applications that 
> > can't afford such optimizations? (eg. the common case for most people)
> 
> The problem with hugetlbfs is that you need to commit upfront to using it, 
> and that you need to be the admin.  For virtualization, you want to use 
> hugepages when there is no memory pressure, but you want to use ksm, 
> ballooning, and swapping when there is (and then go back to large pages when 
> pressure is relieved, e.g. by live migration).
> 
> HPC and databases can probably live with hugetlbfs.  JVM is somewhere in the 
> middle, they do allocate memory dynamically.

Even for HPC hugetlbfs is often not good enough: if the data is being 
constantly acquired and put into a file and if it needs to be in persistent 
storage then you dont want to (and cannot) copy it to hugetlbfs (on a poweroff 
you would lose the file).

Furthermore there's also the deployment barrier of marginal improvements: not 
many apps are willing to change for a +0.1% improvement - or even for a +0.9% 
improvement - _especially_ if that improvement also needs admin access and per 
distribution hackery. (each distribution tends to have their own slightly 
different way of handing filesystems and other permission/configuration 
matters)

We've seen that with sendfile() and splice() an it's no different with 
hugetlbs either.

hugetlbfs is basically a non-default poor-man's solution for something that 
the kernel should be providing transparently. It's a bad hack that is good 
enough to prototype that something works, but it has serious deployment, 
configuration and usage limitations. Only a kernel hacker detached from 
everyday application development and packaging constraints can believe that 
it's a high-quality technical solution.

Transparent hugepages eliminates most of the app-visible disadvantages by 
shuffling the problems into the kernel [and no doubt causing follow-on 
headaches there] and by utilizing the 'power of the default' - and thus 
opening up hugetlbs to far more apps. [*]

It's a really simple mechanism.

Thanks,

	Ingo

[*] Note, it would be even better if the kernel provided the C library [a'ka 
    klibc] and if hugetlbs could be utilized via malloc() et al more 
    transparently by us changing the user-space library in the kernel repo and 
    deploying it to apps via a new kernel that provides an updated C library. 
    We dont do that so we are stuck with crappier solutions and slower 
    propagation of changes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12  7:45                                                             ` Ingo Molnar
@ 2010-04-12  8:14                                                               ` Nick Piggin
  2010-04-12  8:22                                                                 ` Ingo Molnar
  2010-04-12  8:45                                                                 ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  8:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Linus Torvalds, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On Mon, Apr 12, 2010 at 09:45:57AM +0200, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> > On 04/11/2010 06:52 PM, Linus Torvalds wrote:
> > >
> > >On Sun, 11 Apr 2010, Avi Kivity wrote:
> > >>
> > >> And yet Oracle and java have options to use large pages, and we know 
> > >> google and HPC like 'em.  Maybe they just haven't noticed the fundamental 
> > >> brokenness yet.
> 
> ( Add Firefox to the mix too - it too allocates in 1MB/2MB chunks. Perhaps 
>   Xorg as well. )
> 
> > > The thing is, what you are advocating is what traditional UNIX did. 
> > > Prioritizing the special cases rather than the generic workloads.
> > >
> > > And I'm telling you, it's wrong. Traditional Unix is dead, and it's dead 
> > > exactly _because_ it prioritized those kinds of loads.
> > 
> > This is not a specialized workload.  Plenty of sites are running java, 
> > plenty of sites are running Oracle (though that won't benefit from anonymous 
> > hugepages), and plenty of sites are running virtualization.  Not everyone 
> > does two kernel builds before breakfast.
> 
> Java/virtualization/DBs, and, to a certain sense Firefox have basically become 
> meta-kernels: they offer their own intermediate APIs to their own style of 
> apps - and those apps generally have no direct access to the native Linux 
> kernel.
> 
> And just like the native kernel has been enjoying the benefits of 2MB pages 
> for more than a decade, do these other entities want to enjoy similar benefits 
> as well. Fair is fair.
> 
> Like it or not, combined end-user attention/work spent in these meta-kernels 
> is rising steadily, while apps written in raw C are becoming the exception.
> 
> So IMHO we really have roughly three logical choices:

I don't see how these are the logical choices. I don't really see how
they are even logical in some ways. Let's say that Andrea's patches
offer 5% improvement in best-cases (that are not stupid microbenchmarks)
and 0% in worst cases, and X% "on average" (whatever that means). Then
it is simply a set of things to weigh against the added complexity (both
in terms of code and performance characteristics of the system) that it
is introduced.

I don't really see how it is fundamentally different to any other patch
that speeds things up.

 
>  1) either we accept that the situation is the fault of our technology and 
>     subsequently we reform and modernize the Linux syscall ABIs to be more 
>     friendly to apps (offer built-in GC and perhaps JIT concepts, perhaps 
>     offer a compiler, offer a wider range of libraries with better 
>     integration, etc.)

I don't see how this would bring transparent hugepages to userspace. We
may offload some services to the kernel, but the *memory mappings* that
get used by userspace obviously still go through TLBs.

 
>  2) or we accept the fact that the application space is shifting to the
>     meta-kernels - and then we should agressively optimize Linux for those
>     meta-kernels and not pretend that they are 'specialized'. They literally
>     represent tens of thousands of applications apiece.

And if meta-kernels (or whatever you want to call a common or important
workload) see some speedup that is deemed to be worth the cost of the
patch, then it will probably get merged. Same as anything else.

 
>  3) or we should continue to muddle through somewhere in the middle, hoping 
>     that the 'pure C apps' win in the end (despite 10 years of a decline) and
>     pretend that the meta-kernels are just 'specialized' workloads.

'pure C apps' (I don't know what you mean by this, but just non-GC
memory?) can still see benefits from using hugepages.

And I wouldn't say we're muddling through. Linux has been one of the
if not the most successful OS kernel of the last 10 years not because
of muddling. IMO in large part it is because we haven't been forced to
tick boxes for marketing idiots or be pressured by special interests
to the detriment of the common cases.


> Right now we are doing 3) and i think it's delusive and a mistake. I think we 
> should be doing 1) - but failing that we have to be honest and do 2).

Nothing wrong with carefully evaluating a performance improvement, but
there is nothing urgent or huge fundamental reason we need to lose our
heads and be irrational about it. If the world was coming to an end
without hugepages, then we'd see more than 5% improvement I would have
thought.

Fact is that computing is based on locality of reference, and
performance has continued to scale long past the big bad "memory wall"
because real working set sizes (on the scale of CPU instructions, not on
the scale of page reclaim) have not grown linearly with RAM sizes.
Probably logarithmically or something. Sure there are some pointer
chasing apps that will always (and ~have always) suck. We are also
irriversibly getting into explicit parallelism (like multi core and
multi threading) to work around all sorts of fundamental limits to
single thread performance, not just TLB filling.

So let's not be melodramatic about this :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:50                                               ` Avi Kivity
  2010-04-12  8:07                                                 ` Ingo Molnar
@ 2010-04-12  8:18                                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  8:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 10:50:33AM +0300, Avi Kivity wrote:
> The problem with hugetlbfs is that you need to commit upfront to using 
> it, and that you need to be the admin.  For virtualization, you want to 
> use hugepages when there is no memory pressure, but you want to use ksm, 
> ballooning, and swapping when there is (and then go back to large pages 
> when pressure is relieved, e.g. by live migration).
> 
> HPC and databases can probably live with hugetlbfs.  JVM is somewhere in 
> the middle, they do allocate memory dynamically.

I guess lots of the recent work on hugetlbfs has been exactly meant to
try to make hugetlbfs more palatable by things like JVM, the end
result is that it's growing in its own parallel VM but very still
crippled down compared to the real kernel VM.

I see very long term value in hugetlbfs, for example for CPUs that
can't mix different page sizes in the same VMA, or for the 1G page
reservation (no way we're going to slowdown everything increasing
MAX_ORDER so much by default even if fragmentation issues wouldn't
grow exponentially with the order) but I think hugetlbfs should remain
simple and cover optimally these use cases, without trying to expand
itself into the dynamic area of transparent usages where it wasn't
designed to be used in the first place and where it's not a too good
fit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  8:07                                                 ` Ingo Molnar
@ 2010-04-12  8:21                                                   ` Andrea Arcangeli
  2010-04-12 10:27                                                   ` Mel Gorman
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  8:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Nick Piggin, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 10:07:48AM +0200, Ingo Molnar wrote:
> configuration and usage limitations. Only a kernel hacker detached from 
> everyday application development and packaging constraints can believe that 
> it's a high-quality technical solution.

That made my day ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12  8:14                                                               ` Nick Piggin
@ 2010-04-12  8:22                                                                 ` Ingo Molnar
  2010-04-12  8:34                                                                   ` Nick Piggin
  2010-04-12  8:45                                                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-12  8:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Linus Torvalds, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven


* Nick Piggin <npiggin@suse.de> wrote:

> >  2) or we accept the fact that the application space is shifting to the
> >     meta-kernels - and then we should agressively optimize Linux for those
> >     meta-kernels and not pretend that they are 'specialized'. They literally
> >     represent tens of thousands of applications apiece.
> 
> And if meta-kernels (or whatever you want to call a common or important 
> workload) see some speedup that is deemed to be worth the cost of the patch, 
> then it will probably get merged. Same as anything else.

I call a 'meta kernel' something that people code thousands of apps for, 
instead of coding on the native kernel. JVM/DBs/Firefox are such frameworks. 
(you can call it middleware i guess)

By all means they are not a 'single special-purpose workload' but represent 
literally tens of thousands of apps.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  7:45                                               ` Avi Kivity
@ 2010-04-12  8:28                                                 ` Nick Piggin
  2010-04-12  9:01                                                   ` Andrea Arcangeli
  2010-04-12  9:03                                                   ` Avi Kivity
  0 siblings, 2 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  8:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 10:45:16AM +0300, Avi Kivity wrote:
> On 04/12/2010 10:15 AM, Nick Piggin wrote:
> >
> >>>Another thing is that the problem
> >>>of fragmentation breakdown is not just a one-shot event that fills
> >>>memory with pinned objects. It is a slow degredation.
> >>>
> >>>Especially when you use something like SLUB as the memory allocator
> >>>which requires higher order allocations for objects which are pinned
> >>>in kernel memory.
> >>Won't the usual antifrag tactics apply?  Try to allocate those
> >>objects from the same block.
> >"try" is the key point.
> 
> We use the "try" tactic extensively.  So long as there's a
> reasonable chance of success, and a reasonable fallback on failure,
> it's fine.
> 
> Do you think we won't have reasonable success rates?  Why?

After the memory is fragmented? It's more or less irriversable. So
success rates (to fill a specific number of huges pages) will be fine
up to a point. Then it will be a continual failure.

Sure, some workloads simply won't trigger fragmentation problems.
Others will.


> >>>Just running a few minutes of testing with a kernel compile in the
> >>>background does not show the full picture. You really need a box that
> >>>has been up for days running a proper workload before you are likely
> >>>to see any breakdown.
> >>I'm sure we'll be able to generate worst-case scenarios.  I'm also
> >>reasonably sure we'll be able to deal with them.  I hope we won't
> >>need to, but it's even possible to move dentries around.
> >Pinned dentries? (which are the problem) That would be insane.
> 
> Why?  If you can isolate all the pointers into the dentry, allocate
> the new dentry, make the old one point into the new one, hash it,
> move the pointers, drop the old dentry.
> 
> Difficult, yes, but insane?

Yes.

 
> >>>I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> >>>get slower after X days of uptime. It's better to have consistent
> >>>performance really, for anything except pure benchmark setups.
> >>If that were the case we'd disable caches everywhere.  General
> >No we wouldn't. You can have consistent, predictable performance with
> >caches.
> 
> Caches have statistical performance.  In the long run they average
> out.  In the short run they can behave badly.  Same thing with large
> pages, except the runs are longer and the wins are smaller.

You don't understand. Caches don't suddenly or slowly stop working.
For a particular pattern of workload, they statistically pretty much
work the same all the time.

 
> >>purpose computing is a best effort thing, we try to be fast on the
> >>common case but we'll be slow on the uncommon case.  Access to a bit
> >Sure. And the common case for production systems like VM or databse
> >servers that are up for hundreds of days is when they are running with
> >a lot of uptime. Common case is not a fresh reboot into a 3 hour
> >benchmark setup.
> 
> Database are the easiest case, they allocate memory up front and
> don't give it up.  We'll coalesce their memory immediately and
> they'll run happily ever after.

Again, you're thinking about a benchmark setup. If you've got various
admin things, backups, scripts running, probably web servers,
application servers etc. Then it's not all that simple.

And yes, Linux works pretty well for a multi-workload platform. You
might be thinking too much about virtualization where you put things
in sterile little boxes and take the performance hit.

 
> Virtualization will fragment on overcommit, but the load is all
> anonymous memory, so it's easy to defragment.  Very little dcache on
> the host.

If virtualization is the main worry (which it seems that it is
seeing as your TLB misses cost like 6 times more cachelines),
then complexity should be pushed into the hypervisor, not the
core kernel.


> >>Non-linear kernel mapping moves the small page problem from
> >>userspace back to the kernel, a really unhappy solution.
> >Not unhappy for userspace intensive workloads. And user working sets
> >I'm sure are growing faster than kernel working set. Also there would
> >be nothing against compacting and merging kernel memory into larger
> >pages.
> 
> Well, I'm not against it, but that would be a much more intrusive
> change than what this thread is about.  Also, you'd need 4K dentries
> etc, no?

No. You'd just be defragmenting 4K worth of dentries at a time.
Dentries (and anything that doesn't care about untranslated KVA)
are trivial. Zero change for users of the code.

This is going off-topic though, I don't want to hijack the thread
with talk of nonlinear kernel.

 
> >>Very large (object count, not object size) kernel caches can be
> >>addressed by compacting them, but I hope we won't need to do that.
> >You can't say that fragmentation is not a fundamental problem.  And
> >adding things like indirect pointers or weird crap adding complexity
> >to code that deals with KVA IMO is not acceptable. So you can't
> >just assert that you can "address" the problem.
> 
> Mostly we need a way of identifying pointers into a data structure,
> like rmap (after all that's what makes transparent hugepages work).

And that involves auditing and rewriting anything that allocates
and pins kernel memory. It's not only dentries.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12  8:22                                                                 ` Ingo Molnar
@ 2010-04-12  8:34                                                                   ` Nick Piggin
  2010-04-12  8:47                                                                     ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  8:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Linus Torvalds, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On Mon, Apr 12, 2010 at 10:22:18AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > >  2) or we accept the fact that the application space is shifting to the
> > >     meta-kernels - and then we should agressively optimize Linux for those
> > >     meta-kernels and not pretend that they are 'specialized'. They literally
> > >     represent tens of thousands of applications apiece.
> > 
> > And if meta-kernels (or whatever you want to call a common or important 
> > workload) see some speedup that is deemed to be worth the cost of the patch, 
> > then it will probably get merged. Same as anything else.
> 
> I call a 'meta kernel' something that people code thousands of apps for, 
> instead of coding on the native kernel. JVM/DBs/Firefox are such frameworks. 
> (you can call it middleware i guess)
> 
> By all means they are not a 'single special-purpose workload' but represent 
> literally tens of thousands of apps.

I don't think I said anything like 'single special-purpose workload'. I
said 'common or important workload'. And they are not fundamentally
different (in context of evaluating and accepting a performance
improvement) than any other workload.

I'm not saying they don't matter.

The interesting fact is also that such type of thing is also much more
suitable for doing optimisation tricks. JVMs and RDBMS typically can
make use of hugepages already, for example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12  8:14                                                               ` Nick Piggin
  2010-04-12  8:22                                                                 ` Ingo Molnar
@ 2010-04-12  8:45                                                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Avi Kivity, Linus Torvalds, Jason Garrett-Glaser,
	Mike Galbraith, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On Mon, Apr 12, 2010 at 06:14:31PM +1000, Nick Piggin wrote:
> I don't see how these are the logical choices. I don't really see how
> they are even logical in some ways. Let's say that Andrea's patches
> offer 5% improvement in best-cases (that are not stupid microbenchmarks)
> and 0% in worst cases, and X% "on average" (whatever that means). Then
> it is simply a set of things to weigh against the added complexity (both
> in terms of code and performance characteristics of the system) that it
> is introduced.

gcc 8% boost with translate.o has been ruled out as useless benchmark,
but note that definitely it isn't. Yeah maybe one can write that .c
file so it won't take 22 seconds to build, but that's not the point. I
wanted to demonstrates there will be lots of other apps taking
advantage of this. Linux isn't used only to run gcc, people runs
simulations that grows up to unknown amounts of memory on a daily
basis on Linux, it's just I used gcc as example as to show even a gcc
file that we build maybe 2 times a day gets a 8% boost and because gcc
is the most commonly run compute intensive program purely CPU bound
that we're familiar with. If I was building chips instead of writing
kernel, I would have run one of those simulations instead of gcc to
build qemu-kvm translate.

And once I won't have to run khugepaged to move all gcc memory into
hugepages maybe even the kernel build will get a boost ("maybe"
because I'm not convinced, it sounds too good to be true, but it will
try it out later for curiosity ;).

So I think so far what we can be very relaxed to claim is that in real
life "_best_ case" on host without virt the improvement is really ~8%
(and much bigger boost already measured with virt >15%, the best case
of virt I don't know yet).

> I don't really see how it is fundamentally different to any other patch
> that speeds things up.

This is exactly true, the speedup has to be balanced against the
complexity introduced.

I'll add a few more points that can help the evaluation.

You can be 100% sure this can't destabilize *anything* if you echo
never >enabled or boot with transparent_hugepage=0. Furthermore if you
enable embedded and set CONFIG_TRANSPARENT_HUGEPAGE=n 99% of the
new code won't even be built.

The 8% best case speedup should be reproducible on all hardware from
my $150 workstation (maybe even on UP x86 32bit) and even atom UP to
the 4096 cpus numa system (there hopefully it'll be more than 8%
because of the much bigger skew between l2 cache in core and remote
numa memory).

The 8% boost surely will be possible to reproduce with really optimal
written apps and it's not only AIM.

It's not like anon-vma design change that will microslowdown the fast
paths, make head hurts, cannot be disabled at runtime, and it allows
to see a boost only in badly designed apps (Avi once told me fork is
useless, well I don't entirely agree but surely it's not something
good apps should be heavy user of, it's more about going simpler for
something not really enterprise or performance critical, the fact
certain DB uses fork is I think caused by proprietary source designs
and not technical issues).

It's not like speculative pagecache that not only boosts only certain
workloads, and only if you have so many CPUs on the large SMP and
cannot be opted out or disabled if it's unstable.

So it's more complex maybe, but it's zero risk if disabled at runtime
or compile time and it provides at constant speedup to optimally
written apps (huge speedup in case of EPT/NPT). And yeah it'd be cool
if there was a better CPU than the ones with EPT/NPT, surely if
somebody can invent something better than that, tons of people would
be interested, considering little stuff (Google being one of the
exceptions) runs on bare metal these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12  8:34                                                                   ` Nick Piggin
@ 2010-04-12  8:47                                                                     ` Avi Kivity
  0 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12  8:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Linus Torvalds, Jason Garrett-Glaser,
	Mike Galbraith, Andrea Arcangeli, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On 04/12/2010 11:34 AM, Nick Piggin wrote:
> The interesting fact is also that such type of thing is also much more
> suitable for doing optimisation tricks. JVMs and RDBMS typically can
> make use of hugepages already, for example.
>    

That just shows they're important enough for people to care.  What 
transparent hugepages does is remove the tradeoff between performance 
and flexibility that people have to make now, and also allow 
opportunistic speedup on apps that don't have a userbase large enough to 
care.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  8:28                                                 ` Nick Piggin
@ 2010-04-12  9:01                                                   ` Andrea Arcangeli
  2010-04-12  9:03                                                   ` Avi Kivity
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  9:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 06:28:44PM +1000, Nick Piggin wrote:
> If virtualization is the main worry (which it seems that it is
> seeing as your TLB misses cost like 6 times more cachelines),
> then complexity should be pushed into the hypervisor, not the
> core kernel.

It's not just about virtualization on host, or I could have done a
much smaller patch without bothering so much to make something as
universal as possible with cows and stuff.

Also about virtualization you forget that the CPU can establish 2M tlb
entries in guest only if both the guest and the host shadow pagetables
are both pmd_huge, if one of the two pmd isn't huge then the guest
virtual to host physical translation won't be the same for all 512 4k
pages (well it might be if you're extremely lucky but I strongly doubt
the CPU bothers to check the host pfns are contiguous if both guest pmd
and shadow pmd aren't huge).

In other words we've to do something that is totally disconnected from
virtualization, in order to advantage of it to the maximum extent with
virt ;).

This allows to leverage the KVM design compared to vmware or and the
other inferior virtualization designs. We make gcc run 8% faster on a
cheap single socket workstation without virt, and we get even bigger
cumulative boost in virtualized gcc without changing anything at all
in KVM. If this isn't the obvious best way to go, I don't know what it
is! ;)

> And that involves auditing and rewriting anything that allocates
> and pins kernel memory. It's not only dentries.

All not short lived gup pins have to use mmu notifier, no piece of the
kernel is allowed to keep movable pages pinned for more than the time
it takes to complete the DMA. It has to be fixed to provide all other
benefits with GRU, XPMEM now that VM locks are switching to mutex (and
as usual to KVM too).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  8:28                                                 ` Nick Piggin
  2010-04-12  9:01                                                   ` Andrea Arcangeli
@ 2010-04-12  9:03                                                   ` Avi Kivity
  2010-04-12  9:26                                                     ` Nick Piggin
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-12  9:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 11:28 AM, Nick Piggin wrote:
>
>> We use the "try" tactic extensively.  So long as there's a
>> reasonable chance of success, and a reasonable fallback on failure,
>> it's fine.
>>
>> Do you think we won't have reasonable success rates?  Why?
>>      
> After the memory is fragmented? It's more or less irriversable. So
> success rates (to fill a specific number of huges pages) will be fine
> up to a point. Then it will be a continual failure.
>    

So we get just a part of the win, not all of it.

> Sure, some workloads simply won't trigger fragmentation problems.
> Others will.
>    

Some workloads benefit from readahead.  Some don't.  In fact, readahead 
has a higher potential to reduce performance.

Same as with many other optimizations.

>> Why?  If you can isolate all the pointers into the dentry, allocate
>> the new dentry, make the old one point into the new one, hash it,
>> move the pointers, drop the old dentry.
>>
>> Difficult, yes, but insane?
>>      
> Yes.
>    

Well, I'll accept what you say since I'm nowhere near as familiar with 
the code.  But maybe someone insane will come along and do it.

>> Caches have statistical performance.  In the long run they average
>> out.  In the short run they can behave badly.  Same thing with large
>> pages, except the runs are longer and the wins are smaller.
>>      
> You don't understand. Caches don't suddenly or slowly stop working.
> For a particular pattern of workload, they statistically pretty much
> work the same all the time.
>    

Yet your effective cache size can be reduced by unhappy aliasing of 
physical pages in your working set.  It's unlikely but it can happen.

For a statistical mix of workloads, huge pages will also work just 
fine.  Perhaps not all of them, but most (those that don't fill _all_ of 
memory with dentries).

>> Database are the easiest case, they allocate memory up front and
>> don't give it up.  We'll coalesce their memory immediately and
>> they'll run happily ever after.
>>      
> Again, you're thinking about a benchmark setup. If you've got various
> admin things, backups, scripts running, probably web servers,
> application servers etc. Then it's not all that simple.
>    

These are all anonymous/pagecache loads, which we deal with well.

> And yes, Linux works pretty well for a multi-workload platform. You
> might be thinking too much about virtualization where you put things
> in sterile little boxes and take the performance hit.
>
>    

People do it for a reason.

>> Virtualization will fragment on overcommit, but the load is all
>> anonymous memory, so it's easy to defragment.  Very little dcache on
>> the host.
>>      
> If virtualization is the main worry (which it seems that it is
> seeing as your TLB misses cost like 6 times more cachelines),
>    

(just 2x)

> then complexity should be pushed into the hypervisor, not the
> core kernel.
>    

The whole point behind kvm is to reuse the Linux core.  If we have to 
reimplement Linux memory management and scheduling, then it's a failure.

>> Well, I'm not against it, but that would be a much more intrusive
>> change than what this thread is about.  Also, you'd need 4K dentries
>> etc, no?
>>      
> No. You'd just be defragmenting 4K worth of dentries at a time.
> Dentries (and anything that doesn't care about untranslated KVA)
> are trivial. Zero change for users of the code.
>    

I see.

> This is going off-topic though, I don't want to hijack the thread
> with talk of nonlinear kernel.
>    

Too bad, it's interesting.

>> Mostly we need a way of identifying pointers into a data structure,
>> like rmap (after all that's what makes transparent hugepages work).
>>      
> And that involves auditing and rewriting anything that allocates
> and pins kernel memory. It's not only dentries.
>    

Not everything, just the major users that can scale with the amount of 
memory in the machine.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  9:03                                                   ` Avi Kivity
@ 2010-04-12  9:26                                                     ` Nick Piggin
  2010-04-12  9:39                                                       ` Andrea Arcangeli
  2010-04-12 10:02                                                       ` Avi Kivity
  0 siblings, 2 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12  9:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote:
> On 04/12/2010 11:28 AM, Nick Piggin wrote:
> >
> >>We use the "try" tactic extensively.  So long as there's a
> >>reasonable chance of success, and a reasonable fallback on failure,
> >>it's fine.
> >>
> >>Do you think we won't have reasonable success rates?  Why?
> >After the memory is fragmented? It's more or less irriversable. So
> >success rates (to fill a specific number of huges pages) will be fine
> >up to a point. Then it will be a continual failure.
> 
> So we get just a part of the win, not all of it.

It can degrade over time. This is the difference. Two idencial workloads
may have performance X and Y depending on whether uptime is 1 day or 20
days.

 
> >Sure, some workloads simply won't trigger fragmentation problems.
> >Others will.
> 
> Some workloads benefit from readahead.  Some don't.  In fact,
> readahead has a higher potential to reduce performance.
> 
> Same as with many other optimizations.

Do you see any difference with your examples and this issue?

 
> >>Why?  If you can isolate all the pointers into the dentry, allocate
> >>the new dentry, make the old one point into the new one, hash it,
> >>move the pointers, drop the old dentry.
> >>
> >>Difficult, yes, but insane?
> >Yes.
> 
> Well, I'll accept what you say since I'm nowhere near as familiar
> with the code.  But maybe someone insane will come along and do it.

And it'll get nacked :) And it's not only dcache that can cause a
problem. This is part of the whole reason it is insane. It is insane
to only fix the dcache, because if you accept the dcache is a problem
that needs such complexity to fix, then you must accept the same for
the inode caches, the buffer head caches, vmas, radix tree nodes, files
etc. no?

 
> >>Caches have statistical performance.  In the long run they average
> >>out.  In the short run they can behave badly.  Same thing with large
> >>pages, except the runs are longer and the wins are smaller.
> >You don't understand. Caches don't suddenly or slowly stop working.
> >For a particular pattern of workload, they statistically pretty much
> >work the same all the time.
> 
> Yet your effective cache size can be reduced by unhappy aliasing of
> physical pages in your working set.  It's unlikely but it can
> happen.
> 
> For a statistical mix of workloads, huge pages will also work just
> fine.  Perhaps not all of them, but most (those that don't fill
> _all_ of memory with dentries).

Like I said, you don't need to fill all memory with dentries, you
just need to be allocating higher order kernel memory and end up
fragmenting your reclaimable pools.

And it's not a statistical mix that is the problem. The problem is
that the workloads that do cause fragmentation problems will run well
for 1 day or 5 days and then degrade. And it is impossible to know
what will degrade and what won't and by how much.

I'm not saying this is a showstopper, but it does really suck.


> >>Database are the easiest case, they allocate memory up front and
> >>don't give it up.  We'll coalesce their memory immediately and
> >>they'll run happily ever after.
> >Again, you're thinking about a benchmark setup. If you've got various
> >admin things, backups, scripts running, probably web servers,
> >application servers etc. Then it's not all that simple.
> 
> These are all anonymous/pagecache loads, which we deal with well.

Huh? They also involve sockets, files, and involve all of the above
data structures I listed and many more.

 
> >And yes, Linux works pretty well for a multi-workload platform. You
> >might be thinking too much about virtualization where you put things
> >in sterile little boxes and take the performance hit.
> >
> 
> People do it for a reason.

The reasoning is not always sound though. And also people do other
things. Including increasingly better containers and workload
management in the single kernel.

 
> >>Virtualization will fragment on overcommit, but the load is all
> >>anonymous memory, so it's easy to defragment.  Very little dcache on
> >>the host.
> >If virtualization is the main worry (which it seems that it is
> >seeing as your TLB misses cost like 6 times more cachelines),
> 
> (just 2x)
> 
> >then complexity should be pushed into the hypervisor, not the
> >core kernel.
> 
> The whole point behind kvm is to reuse the Linux core.  If we have
> to reimplement Linux memory management and scheduling, then it's a
> failure.

And if you need to add complexity to the Linux core for it, it's
also a failure.

I'm not saying to reimplement things, but if you had a little bit
more support perhaps. Anyway it's just ideas, I'm not saying that
transparent hugepages is wrong simply because KVM is a big user and it
could be implemented in another way.

But if it is possible for KVM to use libhugetlb with just a bit of
support from the kernel, then it goes some way to reducing the
need for transparent hugepages.

 
> >>Well, I'm not against it, but that would be a much more intrusive
> >>change than what this thread is about.  Also, you'd need 4K dentries
> >>etc, no?
> >No. You'd just be defragmenting 4K worth of dentries at a time.
> >Dentries (and anything that doesn't care about untranslated KVA)
> >are trivial. Zero change for users of the code.
> 
> I see.
> 
> >This is going off-topic though, I don't want to hijack the thread
> >with talk of nonlinear kernel.
> 
> Too bad, it's interesting.

It sure is, we can start another thread.

 
> >>Mostly we need a way of identifying pointers into a data structure,
> >>like rmap (after all that's what makes transparent hugepages work).
> >And that involves auditing and rewriting anything that allocates
> >and pins kernel memory. It's not only dentries.
> 
> Not everything, just the major users that can scale with the amount
> of memory in the machine.

Well you need to audit, to determine if it is going to be a problem or
not, and it is more than only dentries. (but even dentries would be a
nightmare considering how widely they're used and how much they're
passed around the vfs and filesystems).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  9:26                                                     ` Nick Piggin
@ 2010-04-12  9:39                                                       ` Andrea Arcangeli
  2010-04-12 10:02                                                       ` Avi Kivity
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12  9:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Avi Kivity, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 07:26:15PM +1000, Nick Piggin wrote:
> But if it is possible for KVM to use libhugetlb with just a bit of
> support from the kernel, then it goes some way to reducing the
> need for transparent hugepages.

KVM has full hugetlbfs support for a long time. There's some people
using it, and it remains a must-have for 1G pages, but it's not
manageable that way in the cloud. It's ok for a special instance
only. Right now all my VM by default are running on hugepages now
without changing a single bit (with a few liner patch to qemu to add a
alignment because the gfn bits in the number range
HPAGE_PMD_SHIFT..PAGE_SHIFT have to be a match to the host pfn bits
for NPT shadows to go pmd_huge). For qemu to run on hugepages not even
the alignment is needed (but it's better to align there too, to be
sure the guest kernel that lives hugepages as it's usually mapped in
the first mbyte).

This is the single change I had to apply to KVM for it to take
advantage of transparent hugepages because it was already working fine
with hugetlbfs:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=d249c189870896b3f275987b70702d2b8c7705d4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  9:26                                                     ` Nick Piggin
  2010-04-12  9:39                                                       ` Andrea Arcangeli
@ 2010-04-12 10:02                                                       ` Avi Kivity
  2010-04-12 10:08                                                         ` Andrea Arcangeli
                                                                           ` (2 more replies)
  1 sibling, 3 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 10:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 12:26 PM, Nick Piggin wrote:
> On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote:
>    
>> On 04/12/2010 11:28 AM, Nick Piggin wrote:
>>      
>>>        
>>>> We use the "try" tactic extensively.  So long as there's a
>>>> reasonable chance of success, and a reasonable fallback on failure,
>>>> it's fine.
>>>>
>>>> Do you think we won't have reasonable success rates?  Why?
>>>>          
>>> After the memory is fragmented? It's more or less irriversable. So
>>> success rates (to fill a specific number of huges pages) will be fine
>>> up to a point. Then it will be a continual failure.
>>>        
>> So we get just a part of the win, not all of it.
>>      
> It can degrade over time. This is the difference. Two idencial workloads
> may have performance X and Y depending on whether uptime is 1 day or 20
> days.
>    

I don't see why it will degrade.  Antifrag will prefer to allocate 
dcache near existing dcache.

The only scenario I can see where it degrades is that you have a dcache 
load that spills over to all of memory, then falls back leaving a pinned 
page in every huge frame.  It can happen, but I don't see it as a likely 
scenario.  But maybe I'm missing something.

>>> Sure, some workloads simply won't trigger fragmentation problems.
>>> Others will.
>>>        
>> Some workloads benefit from readahead.  Some don't.  In fact,
>> readahead has a higher potential to reduce performance.
>>
>> Same as with many other optimizations.
>>      
> Do you see any difference with your examples and this issue?
>    

Memory layout is more persistent.  Well, disk layout is even more 
persistent.  Still we do extents, and if our disk is fragmented, we take 
the hit.

>> Well, I'll accept what you say since I'm nowhere near as familiar
>> with the code.  But maybe someone insane will come along and do it.
>>      
> And it'll get nacked :) And it's not only dcache that can cause a
> problem. This is part of the whole reason it is insane. It is insane
> to only fix the dcache, because if you accept the dcache is a problem
> that needs such complexity to fix, then you must accept the same for
> the inode caches, the buffer head caches, vmas, radix tree nodes, files
> etc. no?
>    

inodes come with dcache, yes.  I thought buffer heads are now a much 
smaller load.  vmas usually don't scale up with memory.  If you have a 
lot of radix tree nodes, then you also have a lot of pagecache, so the 
radix tree nodes can be contained.  Open files also don't scale with memory.

>> Yet your effective cache size can be reduced by unhappy aliasing of
>> physical pages in your working set.  It's unlikely but it can
>> happen.
>>
>> For a statistical mix of workloads, huge pages will also work just
>> fine.  Perhaps not all of them, but most (those that don't fill
>> _all_ of memory with dentries).
>>      
> Like I said, you don't need to fill all memory with dentries, you
> just need to be allocating higher order kernel memory and end up
> fragmenting your reclaimable pools.
>    

Allocate those higher order pages from the same huge frame.

> And it's not a statistical mix that is the problem. The problem is
> that the workloads that do cause fragmentation problems will run well
> for 1 day or 5 days and then degrade. And it is impossible to know
> what will degrade and what won't and by how much.
>
> I'm not saying this is a showstopper, but it does really suck.
>
>    

Can you suggest a real life test workload so we can investigate it?

>> These are all anonymous/pagecache loads, which we deal with well.
>>      
> Huh? They also involve sockets, files, and involve all of the above
> data structures I listed and many more.
>    

A few thousand sockets and open files is chickenfeed for a server.  
They'll kill a few huge frames but won't significantly affect the rest 
of memory.

>
>    
>>> And yes, Linux works pretty well for a multi-workload platform. You
>>> might be thinking too much about virtualization where you put things
>>> in sterile little boxes and take the performance hit.
>>>
>>>        
>> People do it for a reason.
>>      
> The reasoning is not always sound though. And also people do other
> things. Including increasingly better containers and workload
> management in the single kernel.
>    

Containers are wonderful but still a future thing, and even when fully 
implemented they still don't offer the same isolation as 
virtualization.  For example, the owner of workload A might want to 
upgrade the kernel to fix a bug he's hitting, while the owner of 
workload B needs three months to test it.

>> The whole point behind kvm is to reuse the Linux core.  If we have
>> to reimplement Linux memory management and scheduling, then it's a
>> failure.
>>      
> And if you need to add complexity to the Linux core for it, it's
> also a failure.
>    

Well, we need to add complexity, and we already have.  If the acceptance 
criteria for a feature would be 'no new complexity', then the kernel 
would be a lot smaller than it is now.

Everything has to be evaluated on the basis of its generality, the 
benefit, the importance of the subsystem that needs it, and impact on 
the code.  Huge pages are already used in server loads so they're not 
specific to kvm.  The benefit, 5-15%, is significant.  You and Linus 
might not be interested in virtualization, but a significant and growing 
fraction of hosts are virtualized, it's up to us if they run Linux or 
something else.  And I trust Andrea and the reviewers here to keep the 
code impact sane.


> I'm not saying to reimplement things, but if you had a little bit
> more support perhaps. Anyway it's just ideas, I'm not saying that
> transparent hugepages is wrong simply because KVM is a big user and it
> could be implemented in another way.
>    

What do you mean by 'more support'?

> But if it is possible for KVM to use libhugetlb with just a bit of
> support from the kernel, then it goes some way to reducing the
> need for transparent hugepages.
>    

kvm already works with hugetlbfs.  But it's brittle, it means we have to 
choose between performance and overcommit.

>> Not everything, just the major users that can scale with the amount
>> of memory in the machine.
>>      
> Well you need to audit, to determine if it is going to be a problem or
> not, and it is more than only dentries. (but even dentries would be a
> nightmare considering how widely they're used and how much they're
> passed around the vfs and filesystems).
>    

pages are passed around everywhere as well.  When something is locked or 
its reference count doesn't match the reachable pointer count, you give 
up.  Only a small number of objects are in active use at any one time.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:02                                                       ` Avi Kivity
@ 2010-04-12 10:08                                                         ` Andrea Arcangeli
  2010-04-12 10:10                                                           ` Avi Kivity
  2010-04-12 10:37                                                         ` Nick Piggin
  2010-04-13  0:38                                                         ` Andrew Morton
  2 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 10:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote:
> The only scenario I can see where it degrades is that you have a dcache 
> load that spills over to all of memory, then falls back leaving a pinned 
> page in every huge frame.  It can happen, but I don't see it as a likely 
> scenario.  But maybe I'm missing something.

And in my understanding this is exactly the scenario that kernelcore=
should prevent from ever materialize. Providing math guarantees
without kernelcore= is probably futile.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:08                                                         ` Andrea Arcangeli
@ 2010-04-12 10:10                                                           ` Avi Kivity
  2010-04-12 10:23                                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 10:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 01:08 PM, Andrea Arcangeli wrote:
> On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote:
>    
>> The only scenario I can see where it degrades is that you have a dcache
>> load that spills over to all of memory, then falls back leaving a pinned
>> page in every huge frame.  It can happen, but I don't see it as a likely
>> scenario.  But maybe I'm missing something.
>>      
> And in my understanding this is exactly the scenario that kernelcore=
> should prevent from ever materialize. Providing math guarantees
> without kernelcore= is probably futile.
>    

Well, that forces the user to make a different boot-time tradeoff.  It's 
unsatisfying.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:10                                                           ` Avi Kivity
@ 2010-04-12 10:23                                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 10:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 01:10:46PM +0300, Avi Kivity wrote:
> On 04/12/2010 01:08 PM, Andrea Arcangeli wrote:
> > On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote:
> >    
> >> The only scenario I can see where it degrades is that you have a dcache
> >> load that spills over to all of memory, then falls back leaving a pinned
> >> page in every huge frame.  It can happen, but I don't see it as a likely
> >> scenario.  But maybe I'm missing something.
> >>      
> > And in my understanding this is exactly the scenario that kernelcore=
> > should prevent from ever materialize. Providing math guarantees
> > without kernelcore= is probably futile.
> >    
> 
> Well, that forces the user to make a different boot-time tradeoff.  It's 
> unsatisfying.

Well this is just about the math guarantee, like disabling memory
overcommit to have better guarantee not to run into the oom
killer... most people won't need this but it can address the math
concerns. I think it's enough if people wants a guarantee and it won't
require using nonlinear mapping for kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  8:07                                                 ` Ingo Molnar
  2010-04-12  8:21                                                   ` Andrea Arcangeli
@ 2010-04-12 10:27                                                   ` Mel Gorman
  1 sibling, 0 replies; 205+ messages in thread
From: Mel Gorman @ 2010-04-12 10:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Nick Piggin, Andrea Arcangeli, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 10:07:48AM +0200, Ingo Molnar wrote:
> 
> <SNIP>
> 
> [*] Note, it would be even better if the kernel provided the C library [a'ka 
>     klibc] and if hugetlbs could be utilized via malloc() et al more 

hugectl --heap 

does this. It uses the __morecore hook in glibc to back malloc with
files on hugetlbfs. There is also a programming API with some basic
usage at http://www.csn.ul.ie/~mel/docs/stream-api/

The difference in distributions will hopefully be ironed out by
replacing custom scripts with calls to hugeadm to do the bulk of the
configuration work - e.g. creating mount points and permissions. 

There is no need to be creating a new user-space library in the kernel
repo.

>     transparently by us changing the user-space library in the kernel repo and 
>     deploying it to apps via a new kernel that provides an updated C library. 
>     We dont do that so we are stuck with crappier solutions and slower 
>     propagation of changes.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:02                                                       ` Avi Kivity
  2010-04-12 10:08                                                         ` Andrea Arcangeli
@ 2010-04-12 10:37                                                         ` Nick Piggin
  2010-04-12 10:59                                                           ` Avi Kivity
  2010-04-13  0:38                                                         ` Andrew Morton
  2 siblings, 1 reply; 205+ messages in thread
From: Nick Piggin @ 2010-04-12 10:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 01:02:34PM +0300, Avi Kivity wrote:
> On 04/12/2010 12:26 PM, Nick Piggin wrote:
> >On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote:
> >>On 04/12/2010 11:28 AM, Nick Piggin wrote:
> >>>>We use the "try" tactic extensively.  So long as there's a
> >>>>reasonable chance of success, and a reasonable fallback on failure,
> >>>>it's fine.
> >>>>
> >>>>Do you think we won't have reasonable success rates?  Why?
> >>>After the memory is fragmented? It's more or less irriversable. So
> >>>success rates (to fill a specific number of huges pages) will be fine
> >>>up to a point. Then it will be a continual failure.
> >>So we get just a part of the win, not all of it.
> >It can degrade over time. This is the difference. Two idencial workloads
> >may have performance X and Y depending on whether uptime is 1 day or 20
> >days.
> 
> I don't see why it will degrade.  Antifrag will prefer to allocate
> dcache near existing dcache.
> 
> The only scenario I can see where it degrades is that you have a
> dcache load that spills over to all of memory, then falls back
> leaving a pinned page in every huge frame.  It can happen, but I
> don't see it as a likely scenario.  But maybe I'm missing something.

No, it doesn't need to make all hugepages unavailable in order to
start degrading. The moment that fewer huge pages are available than
can be used, due to fragmentation, is when you could start seeing
fragmentation.

If you're using higher order allocations in the kernel, like SLUB
will especially (and SLAB will for some things) then the requirement
for fragmentation basically gets smaller by I think about the same
factor as the page size. So order-2 slabs only need to fill 1/4 of
memory in order to be able to fragment entire memory. But fragmenting
entire memory is not the start of the degredation, it is the end.

 
> >>>Sure, some workloads simply won't trigger fragmentation problems.
> >>>Others will.
> >>Some workloads benefit from readahead.  Some don't.  In fact,
> >>readahead has a higher potential to reduce performance.
> >>
> >>Same as with many other optimizations.
> >Do you see any difference with your examples and this issue?
> 
> Memory layout is more persistent.  Well, disk layout is even more
> persistent.  Still we do extents, and if our disk is fragmented, we
> take the hit.

Sure, and that's not a good thing either.

 
> >>Well, I'll accept what you say since I'm nowhere near as familiar
> >>with the code.  But maybe someone insane will come along and do it.
> >And it'll get nacked :) And it's not only dcache that can cause a
> >problem. This is part of the whole reason it is insane. It is insane
> >to only fix the dcache, because if you accept the dcache is a problem
> >that needs such complexity to fix, then you must accept the same for
> >the inode caches, the buffer head caches, vmas, radix tree nodes, files
> >etc. no?
> 
> inodes come with dcache, yes.  I thought buffer heads are now a much
> smaller load.  vmas usually don't scale up with memory.  If you have
> a lot of radix tree nodes, then you also have a lot of pagecache, so
> the radix tree nodes can be contained.  Open files also don't scale
> with memory.

See above; we don't need to fill all memory, especially with higher
order allocations.

Definitely some workloads that never use much kernel memory will
probably not see fragmentation problems.

 
> >>Yet your effective cache size can be reduced by unhappy aliasing of
> >>physical pages in your working set.  It's unlikely but it can
> >>happen.
> >>
> >>For a statistical mix of workloads, huge pages will also work just
> >>fine.  Perhaps not all of them, but most (those that don't fill
> >>_all_ of memory with dentries).
> >Like I said, you don't need to fill all memory with dentries, you
> >just need to be allocating higher order kernel memory and end up
> >fragmenting your reclaimable pools.
> 
> Allocate those higher order pages from the same huge frame.

We don't keep different pools of different frame sizes around
to allocate different object sizes in. That would get even weirder
than the existing anti-frag stuff with overflow and fallback rules.

 
> >And it's not a statistical mix that is the problem. The problem is
> >that the workloads that do cause fragmentation problems will run well
> >for 1 day or 5 days and then degrade. And it is impossible to know
> >what will degrade and what won't and by how much.
> >
> >I'm not saying this is a showstopper, but it does really suck.
> >
> 
> Can you suggest a real life test workload so we can investigate it?
> 
> >>These are all anonymous/pagecache loads, which we deal with well.
> >Huh? They also involve sockets, files, and involve all of the above
> >data structures I listed and many more.
> 
> A few thousand sockets and open files is chickenfeed for a server.
> They'll kill a few huge frames but won't significantly affect the
> rest of memory.

Lots of small files is very common for a web server for example.


> >>>And yes, Linux works pretty well for a multi-workload platform. You
> >>>might be thinking too much about virtualization where you put things
> >>>in sterile little boxes and take the performance hit.
> >>>
> >>People do it for a reason.
> >The reasoning is not always sound though. And also people do other
> >things. Including increasingly better containers and workload
> >management in the single kernel.
> 
> Containers are wonderful but still a future thing, and even when
> fully implemented they still don't offer the same isolation as
> virtualization.  For example, the owner of workload A might want to
> upgrade the kernel to fix a bug he's hitting, while the owner of
> workload B needs three months to test it.

But better for performance in general.

 
> >>The whole point behind kvm is to reuse the Linux core.  If we have
> >>to reimplement Linux memory management and scheduling, then it's a
> >>failure.
> >And if you need to add complexity to the Linux core for it, it's
> >also a failure.
> 
> Well, we need to add complexity, and we already have.  If the
> acceptance criteria for a feature would be 'no new complexity', then
> the kernel would be a lot smaller than it is now.
> 
> Everything has to be evaluated on the basis of its generality, the
> benefit, the importance of the subsystem that needs it, and impact
> on the code.  Huge pages are already used in server loads so they're
> not specific to kvm.  The benefit, 5-15%, is significant.  You and
> Linus might not be interested in virtualization, but a significant
> and growing fraction of hosts are virtualized, it's up to us if they
> run Linux or something else.  And I trust Andrea and the reviewers
> here to keep the code impact sane.

I'm being realistic. I know sure it is just to be evaluated based
on gains, complexity, alternatives, etc.

When I hear arguments like we must do this because memory to cache
ratio has got 100 times worse and ergo we're on the brink of
catastrophe, that's when things get silly.


> >I'm not saying to reimplement things, but if you had a little bit
> >more support perhaps. Anyway it's just ideas, I'm not saying that
> >transparent hugepages is wrong simply because KVM is a big user and it
> >could be implemented in another way.
> 
> What do you mean by 'more support'?
> 
> >But if it is possible for KVM to use libhugetlb with just a bit of
> >support from the kernel, then it goes some way to reducing the
> >need for transparent hugepages.
> 
> kvm already works with hugetlbfs.  But it's brittle, it means we
> have to choose between performance and overcommit.

Overcommit because it doesn't work with swapping? Or something more?


> >>Not everything, just the major users that can scale with the amount
> >>of memory in the machine.
> >Well you need to audit, to determine if it is going to be a problem or
> >not, and it is more than only dentries. (but even dentries would be a
> >nightmare considering how widely they're used and how much they're
> >passed around the vfs and filesystems).
> 
> pages are passed around everywhere as well.  When something is
> locked or its reference count doesn't match the reachable pointer
> count, you give up.  Only a small number of objects are in active
> use at any one time.

Easier said than done, I suspect.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  8:06                                               ` Andrea Arcangeli
@ 2010-04-12 10:44                                                 ` Mel Gorman
  2010-04-12 11:12                                                   ` Avi Kivity
  2010-04-12 13:17                                                   ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Mel Gorman @ 2010-04-12 10:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Ingo Molnar, Avi Kivity, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 10:06:26AM +0200, Andrea Arcangeli wrote:
> On Mon, Apr 12, 2010 at 05:21:44PM +1000, Nick Piggin wrote:
> > On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote:
> > > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote:
> > > > One problem is that you need to keep a lot more memory free in order
> > > > for it to be reasonably effective. Another thing is that the problem
> > > > of fragmentation breakdown is not just a one-shot event that fills
> > > > memory with pinned objects. It is a slow degredation.
> > > 
> > > set_recommended_min_free_kbytes seems to not be in function of ram
> > > size, 60MB aren't such a big deal.
> > > 
> > > > Especially when you use something like SLUB as the memory allocator
> > > > which requires higher order allocations for objects which are pinned
> > > > in kernel memory.
> > > > 
> > > > Just running a few minutes of testing with a kernel compile in the
> > > > background does not show the full picture. You really need a box that
> > > > has been up for days running a proper workload before you are likely
> > > > to see any breakdown.
> > > > 
> > > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually
> > > > get slower after X days of uptime. It's better to have consistent
> > > > performance really, for anything except pure benchmark setups.
> > > 
> > > All data I provided is very real, in addition to building a ton of
> > > packages and running emerge on /usr/portage I've been running all my
> > > real loads. Only problem I only run it for 1 day and half, but the
> > > load I kept it under was significant (surely a lot bigger inode/dentry
> > > load that any hypervisor usage would ever generate).
> > 
> > OK, but as a solution for some kind of very specific and highly
> > optimized application already like RDBMS, HPC, hypervisor or JVM,
> > they could just be using hugepages themselves, couldn't they?
> >
> > It seems more interesting as a more general speedup for applications
> > that can't afford such optimizations? (eg. the common case for
> > most people)
> 
> The reality is that very few are using hugetlbfs. I guess maybe 0.1%
> of KVM instances on phenom/nahlem chips are running on hugetlbfs for
> example (hugetlbfs boot reservation doesn't fit the cloud where you
> need all ram available in hugetlbfs and you still need 100% of unused
> ram as host pagecache for VDI),

As a side-note, this is what dynamic hugepage pool resizing was for.

hugeadm --pool-pages-max <size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>

The hugepage pool grows and shrinks as required if the system is able to
allocate the huge pages. If the huge pages are not available, mmap() returns
NULL and userspace is expected to recover by retrying the allocation with
small pages (something libhugetlbfs does automatically).

In the virtualisation context, the greater problem with such an approach
is no-overcommit is possible. I am given to understand that this is a
major problem because hosts of virtual machines are often overcommitted
on the assumption they don't all peak at the same time.

> despite it would provide a >=6% boosts
> to all VM no matter what's running on the guest. Same goes for the
> JVM, maybe 0.1% of those runs on hugetlbfs. The commercial DBMS are
> the exception and they're probably closer to 99% running on hugetlbfs
> (and they've to keep using hugetlbfs until we move transparent
> hugepages in tmpfs). But as
> 

The DBMS documentation often appears to put a greater emphasis on huge
page tuning than the applications that depend on the JVM. 

> So there's a ton of wasted energy in my view. Like Ingo said, the
> faster they make the chips and the cheaper the RAM becomes, the more
> wasted energy as result of not using hugetlbfs. There's always more
> difference between cache sizes and ram sizes and also more difference
> between cache speeds and ram speeds. I don't see this trend ending and
> I'm not sure what is the better CPU that will make hugetlbfs worthless
> and unselectable at kernel configure time on x86 arch (if you build
> without generic).
> 
> And I don't think it's feasible to ship a distro where 99% of apps
> that can benefit from hugepages are running with
> LD_PRELOAD=libhugetlbfs.so. It has to be transparent if we want to
> stop the waste.
> 

I don't see such a thing happening. Huge pages on hugetlbfs do not swap and
would be like calling mlock aggressively.

> The main reason I've always been skeptical about transparent hugepages
> before I started working on this is the mess they generate on the
> whole kernel. So my priority of course has been to keep it self
> contained as much as possible. It kept spilling over and over until I
> managed to confine it to anonymous pages and fix whole mm/.c files
> with just a one liner (even the hugepage aware implementation that
> Johannes did still takes advantage of split_huge_page_pmd if the
> mprotect start/end isn't 2M naturally aligned, just to show how
> complex it would be to do it all at once). This will allow us to reach
> a solid base, and then later move to tmpfs and maybe later to
> pagecache and swapcache too. Pretending the whole kernel to become
> hugepage aware at once is a total mess, gup would need to return only
> head pages for example and breaking hundred of drivers in just that
> change. The compound_lock can be removed after you fix all those
> hundred of drivers and subsystems using gup... No big deal to remove
> it later, kind of you're removing the big kernel lock these days after
> 14 years of when it has been introduced.
> 
> Plus I did all I could to try to keep it as black and white as
> possible. I think other OS are more gray in their approaches, my
> priority has been to pay for RAM anywhere I could if you set
> enabled=always, and to decrease as much as I could any risk of
> performance regressions in any workload. These days we can afford to
> lose 1G without much worry if it speedup the workload 8%, so I think
> the other designs are better for old hardware RAM constrainted and not
> very actual. On embedded with my patchset one should set
> enabled=madvise. Ingo suggested a per-process tweak to enable it
> selectively on certain apps, that is feasible too in the future (so
> people won't be forced to modify binaries to add madvise if they can't
> leave enabled=always).
> 
> > Yes we do have the option to reserve pages and as far as I know it
> > should work, although I can't remember whether it deals with mlock.
> 
> I think that is the right route to take for who needs the
> math-guarantees, and for many products it won't even be noticeable to
> enforce the math guarantee. It's kind of overcommit, somebody prefers
> the = 2 version and maybe they don't even notice it allows them to
> allocate less memory. Others prefers to be able to allocate ram
> without accounting for the unused virtual regions despite the bigger
> chance to run into the oom killer (and I'm in the latter camp for both
> overcommit sysctl and kernelcore= ;).
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:37                                                         ` Nick Piggin
@ 2010-04-12 10:59                                                           ` Avi Kivity
  2010-04-12 12:23                                                             ` Avi Kivity
  2010-04-12 13:25                                                             ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 10:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 01:37 PM, Nick Piggin wrote:
>
>> I don't see why it will degrade.  Antifrag will prefer to allocate
>> dcache near existing dcache.
>>
>> The only scenario I can see where it degrades is that you have a
>> dcache load that spills over to all of memory, then falls back
>> leaving a pinned page in every huge frame.  It can happen, but I
>> don't see it as a likely scenario.  But maybe I'm missing something.
>>      
> No, it doesn't need to make all hugepages unavailable in order to
> start degrading. The moment that fewer huge pages are available than
> can be used, due to fragmentation, is when you could start seeing
> fragmentation.
>    

Graceful degradation is fine.  We're degrading to the current situation 
here, not something worse.

> If you're using higher order allocations in the kernel, like SLUB
> will especially (and SLAB will for some things) then the requirement
> for fragmentation basically gets smaller by I think about the same
> factor as the page size. So order-2 slabs only need to fill 1/4 of
> memory in order to be able to fragment entire memory. But fragmenting
> entire memory is not the start of the degredation, it is the end.
>    

Those order-2 slabs should be allocated in the same page frame.  If 
they're allocated randomly, sure, you need 1 allocation per huge page 
frame.  If you're filling up huge page frames, things look a lot better.

>
>    
>>>>> Sure, some workloads simply won't trigger fragmentation problems.
>>>>> Others will.
>>>>>            
>>>> Some workloads benefit from readahead.  Some don't.  In fact,
>>>> readahead has a higher potential to reduce performance.
>>>>
>>>> Same as with many other optimizations.
>>>>          
>>> Do you see any difference with your examples and this issue?
>>>        
>> Memory layout is more persistent.  Well, disk layout is even more
>> persistent.  Still we do extents, and if our disk is fragmented, we
>> take the hit.
>>      
> Sure, and that's not a good thing either.
>    

And yet we live with it for decades; and we use more or less the same 
techniques to avoid it.


>> inodes come with dcache, yes.  I thought buffer heads are now a much
>> smaller load.  vmas usually don't scale up with memory.  If you have
>> a lot of radix tree nodes, then you also have a lot of pagecache, so
>> the radix tree nodes can be contained.  Open files also don't scale
>> with memory.
>>      
> See above; we don't need to fill all memory, especially with higher
> order allocations.
>    

Not if you allocate carefully.

> Definitely some workloads that never use much kernel memory will
> probably not see fragmentation problems.
>
>    

Right; and on a 16-64GB machine you'll have a hard time filling kernel 
memory with objects.

>>> Like I said, you don't need to fill all memory with dentries, you
>>> just need to be allocating higher order kernel memory and end up
>>> fragmenting your reclaimable pools.
>>>        
>> Allocate those higher order pages from the same huge frame.
>>      
> We don't keep different pools of different frame sizes around
> to allocate different object sizes in. That would get even weirder
> than the existing anti-frag stuff with overflow and fallback rules.
>    

Maybe we should, once we start to use a lot of such objects.

Once you have 10MB worth of inodes, you don't lose anything by 
allocating their slabs from 2MB units.

>> A few thousand sockets and open files is chickenfeed for a server.
>> They'll kill a few huge frames but won't significantly affect the
>> rest of memory.
>>      
> Lots of small files is very common for a web server for example.
>    

10k files? 100k files?  how many open at once?

Even 1M files is ~1GB, not touching our 64GB server.

Most content is dynamic these days anyway.

>> Containers are wonderful but still a future thing, and even when
>> fully implemented they still don't offer the same isolation as
>> virtualization.  For example, the owner of workload A might want to
>> upgrade the kernel to fix a bug he's hitting, while the owner of
>> workload B needs three months to test it.
>>      
> But better for performance in general.
>
>    

True.  But virtualization has the advantage of actually being there.

Note that kvm is also benefiting from containers to improve resource 
isolation.

>> Everything has to be evaluated on the basis of its generality, the
>> benefit, the importance of the subsystem that needs it, and impact
>> on the code.  Huge pages are already used in server loads so they're
>> not specific to kvm.  The benefit, 5-15%, is significant.  You and
>> Linus might not be interested in virtualization, but a significant
>> and growing fraction of hosts are virtualized, it's up to us if they
>> run Linux or something else.  And I trust Andrea and the reviewers
>> here to keep the code impact sane.
>>      
> I'm being realistic. I know sure it is just to be evaluated based
> on gains, complexity, alternatives, etc.
>
> When I hear arguments like we must do this because memory to cache
> ratio has got 100 times worse and ergo we're on the brink of
> catastrophe, that's when things get silly.
>    

That wasn't me.  It's 5-15%, not earth shattering, but significant.  
Especially when we hear things like 1% performance regression per kernel 
release on average.

And it's true that the gain will grow as machines grow.

>>> But if it is possible for KVM to use libhugetlb with just a bit of
>>> support from the kernel, then it goes some way to reducing the
>>> need for transparent hugepages.
>>>        
>> kvm already works with hugetlbfs.  But it's brittle, it means we
>> have to choose between performance and overcommit.
>>      
> Overcommit because it doesn't work with swapping? Or something more?
>    

kvm overcommit uses ballooning, page merging, and swapping.  None of 
these work well with large pages (well, ballooning might).

>> pages are passed around everywhere as well.  When something is
>> locked or its reference count doesn't match the reachable pointer
>> count, you give up.  Only a small number of objects are in active
>> use at any one time.
>>      
> Easier said than done, I suspect.
>    

No doubt it's very tricky code.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:44                                                 ` Mel Gorman
@ 2010-04-12 11:12                                                   ` Avi Kivity
  2010-04-12 13:17                                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 11:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Nick Piggin, Ingo Molnar, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 01:44 PM, Mel Gorman wrote:
> I don't see such a thing happening. Huge pages on hugetlbfs do not swap and
> would be like calling mlock aggressively.
>    

Yes, we keep talking about defragmentation, but the nice thing about 
transparent huge pages is the ability to fragment when needed.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
  2010-04-11 12:01                                                     ` Avi Kivity
  2010-04-11 15:22                                                     ` Linus Torvalds
@ 2010-04-12 11:22                                                     ` Arjan van de Ven
  2010-04-12 11:29                                                       ` Avi Kivity
  2010-04-12 13:30                                                       ` Andrea Arcangeli
  2 siblings, 2 replies; 205+ messages in thread
From: Arjan van de Ven @ 2010-04-12 11:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sun, 11 Apr 2010 13:52:29 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> Also, the proportion of 4K:2MB is a fixed constant, and CPUs dont
> grow their TLB caches as much as typical RAM size grows: they'll grow
> it according to the _mean_ working set size - while the 'max' working
> set gets larger and larger due to the increasing [proportional] gap
> to RAM size.

> This is why i think we should think about hugetlb support today and
> this is why i think we should consider elevating hugetlbs to the next
> level of built-in Linux VM support.


I respectfully disagree with your analysis.
While it is true that the number of "level 1" tlb entries has not kept
up with ram or application size, the CPU designers have made it so that
there effectively is a "level 2" (or technically, level 3) in the cache.

A tlb miss from cache is so cheap that in almost all cases (you can
cheat it by using only 1 byte per page, walking randomly through memory
and having a strict ordering between those 1 byte accesses) it is
hidden in the out of order engine.

So in practice, for many apps, as long as the CPU cache scales with
application size the TLB more or less scales too.

Now hugepages have some interesting other advantages, namely they save
pagetable memory..which for something like TPC-C on a fork based
database can be a measureable win.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 11:22                                                     ` Arjan van de Ven
@ 2010-04-12 11:29                                                       ` Avi Kivity
  2010-04-17 15:12                                                         ` Arjan van de Ven
  2010-04-12 13:30                                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 11:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/12/2010 02:22 PM, Arjan van de Ven wrote:
>
>> This is why i think we should think about hugetlb support today and
>> this is why i think we should consider elevating hugetlbs to the next
>> level of built-in Linux VM support.
>>      
>
> I respectfully disagree with your analysis.
> While it is true that the number of "level 1" tlb entries has not kept
> up with ram or application size, the CPU designers have made it so that
> there effectively is a "level 2" (or technically, level 3) in the cache.
>
> A tlb miss from cache is so cheap that in almost all cases (you can
> cheat it by using only 1 byte per page, walking randomly through memory
> and having a strict ordering between those 1 byte accesses) it is
> hidden in the out of order engine.
>    

Pointer chasing defeats OoO.  The cpu is limited in the amount of 
speculation it can do.

Since you will likely miss on the data access, you have two memory 
accesses to hide (3 for virt).

> So in practice, for many apps, as long as the CPU cache scales with
> application size the TLB more or less scales too.
>    

A 16MB cache maps 8GB of memory (4GB with virtualization), leaving 
nothing for data.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:59                                                           ` Avi Kivity
@ 2010-04-12 12:23                                                             ` Avi Kivity
  2010-04-12 13:25                                                             ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 12:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On 04/12/2010 01:59 PM, Avi Kivity wrote:
>>> Containers are wonderful but still a future thing, and even when
>>> fully implemented they still don't offer the same isolation as
>>> virtualization.  For example, the owner of workload A might want to
>>> upgrade the kernel to fix a bug he's hitting, while the owner of
>>> workload B needs three months to test it.
>> But better for performance in general.
>>
>
> True.  But virtualization has the advantage of actually being there.

btw, containers are way more intrusive than all the kvm related changes 
put together, and still not done.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:44                                                 ` Mel Gorman
  2010-04-12 11:12                                                   ` Avi Kivity
@ 2010-04-12 13:17                                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 13:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Ingo Molnar, Avi Kivity, Mike Galbraith,
	Jason Garrett-Glaser, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 11:44:51AM +0100, Mel Gorman wrote:
> As a side-note, this is what dynamic hugepage pool resizing was for.
> 
> hugeadm --pool-pages-max <size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>
> 
> The hugepage pool grows and shrinks as required if the system is able to
> allocate the huge pages. If the huge pages are not available, mmap() returns
> NULL and userspace is expected to recover by retrying the allocation with
> small pages (something libhugetlbfs does automatically).

If 99% of the virtual space is backed by hugepages and just the last
2M have to be backed by regular pages that's fine with us, we want to
use hugepages for the 99% of the memory.

> In the virtualisation context, the greater problem with such an approach
> is no-overcommit is possible. I am given to understand that this is a
> major problem because hosts of virtual machines are often overcommitted
> on the assumption they don't all peak at the same time.

Yep, other things that come to mind is that we need KSM to split and
merge hugepages when they're found equal, not yet working right now
but it's more natural to do it in the core VM as KSM pages then have
to be swapped too and mixed in the same vma with regular pages and
hugepages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:59                                                           ` Avi Kivity
  2010-04-12 12:23                                                             ` Avi Kivity
@ 2010-04-12 13:25                                                             ` Andrea Arcangeli
  1 sibling, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 13:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, Apr 12, 2010 at 01:59:06PM +0300, Avi Kivity wrote:
> Right; and on a 16-64GB machine you'll have a hard time filling kernel 
> memory with objects.

Yep, this is worth mentioning, the more RAM there is, the higher
percentage of the freeable memory won't be fragmented, even without
kernelcore=. Which is probably why we won't ever need to worry about
kernelcore=.

> kvm overcommit uses ballooning, page merging, and swapping.  None of 
> these work well with large pages (well, ballooning might).

KSM is the only one that will need some further modification to be
able to merge the equal contents inside hugepages. It already can
co-exist (I tested it) but right now it will skip over hugepages and
it's only able to merge regular pages if there's any. We need to make
it hugepage aware and to split the hugepages when it finds stuff to
merge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 11:22                                                     ` Arjan van de Ven
  2010-04-12 11:29                                                       ` Avi Kivity
@ 2010-04-12 13:30                                                       ` Andrea Arcangeli
  2010-04-12 13:33                                                         ` Avi Kivity
  2010-04-13 11:38                                                         ` Ingo Molnar
  1 sibling, 2 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 13:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Avi Kivity, Jason Garrett-Glaser, Mike Galbraith,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote:
> Now hugepages have some interesting other advantages, namely they save
> pagetable memory..which for something like TPC-C on a fork based
> database can be a measureable win.

It doesn't save pagetable memory (as in `grep MemFree
/proc/meminfo`). To achive that we'd need to return -ENOMEM from
split_huge_page_pmd and split_huge_page, which would complicate things
significantly. I'd prefer if we could get rid gradually of
split_huge_page_pmd calls instead of having to handle a retval in
several inner nested functions that don't contemplate returning error
like all their callers.

I think the saving in pagetables isn't really interesting... it's a
couple of gigabytes but it doesn't move the needle as much as being
able to boost CPU performance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 13:30                                                       ` Andrea Arcangeli
@ 2010-04-12 13:33                                                         ` Avi Kivity
  2010-04-12 13:39                                                           ` Andrea Arcangeli
  2010-04-13 11:38                                                         ` Ingo Molnar
  1 sibling, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Arjan van de Ven, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/12/2010 04:30 PM, Andrea Arcangeli wrote:
> On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote:
>    
>> Now hugepages have some interesting other advantages, namely they save
>> pagetable memory..which for something like TPC-C on a fork based
>> database can be a measureable win.
>>      
> It doesn't save pagetable memory (as in `grep MemFree
> /proc/meminfo`).

So where does the pagetable go?

> To achive that we'd need to return -ENOMEM from
> split_huge_page_pmd and split_huge_page, which would complicate things
> significantly. I'd prefer if we could get rid gradually of
> split_huge_page_pmd calls instead of having to handle a retval in
> several inner nested functions that don't contemplate returning error
> like all their callers.
>
> I think the saving in pagetables isn't really interesting... it's a
> couple of gigabytes but it doesn't move the needle as much as being
> able to boost CPU performance.
>    

Fork-based (or process+shm based, like Oracle) replicate the page tables 
per process, so it's N * 0.2%, which would be quite large.  We could 
share pmds for large shared memory areas, but it wouldn't be easy.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 13:33                                                         ` Avi Kivity
@ 2010-04-12 13:39                                                           ` Andrea Arcangeli
  2010-04-12 13:53                                                             ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 13:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Arjan van de Ven, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 12, 2010 at 04:33:50PM +0300, Avi Kivity wrote:
> So where does the pagetable go?

They're preallocated together with the hugepage and queued into the mm
to retain locality. This way a huge pmd can be converted to a regular
pmd pointing to the preallocated pte on the fly without GFP_KERNEL
allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 13:39                                                           ` Andrea Arcangeli
@ 2010-04-12 13:53                                                             ` Avi Kivity
  0 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 13:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Arjan van de Ven, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/12/2010 04:39 PM, Andrea Arcangeli wrote:
> On Mon, Apr 12, 2010 at 04:33:50PM +0300, Avi Kivity wrote:
>    
>> So where does the pagetable go?
>>      
> They're preallocated together with the hugepage and queued into the mm
> to retain locality. This way a huge pmd can be converted to a regular
> pmd pointing to the preallocated pte on the fly without GFP_KERNEL
> allocations.
>    

Oh.  Well I hope this can be eliminated in the future somehow.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-10 18:47                         ` Andrea Arcangeli
  2010-04-10 19:02                           ` Ingo Molnar
@ 2010-04-12 14:24                           ` Christoph Lameter
  2010-04-12 14:49                             ` Avi Kivity
  1 sibling, 1 reply; 205+ messages in thread
From: Christoph Lameter @ 2010-04-12 14:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sat, 10 Apr 2010, Andrea Arcangeli wrote:

> Full agreement! I think everyone wants transparent hugepage, the only
> compliant I ever heard so far is from Christoph that has some slight
> preference on not introducing split_huge_page and going full hugepage
> everywhere, with native in gup immediately where GUP only returns head
> pages and every caller has to check PageTransHuge on them to see if
> it's huge or not. Changing several hundred of drivers in one go and
> with native swapping with hugepage backed swapcache immediately, which
> means also pagecache has to deal with hugepages immediately, is
> possible too, but I think this more gradual approach is easier to keep
> under control, Rome wasn't built in a day. Surely in a second time I
> want tmpfs backed by hugepages too at least. And maybe pagecache, but
> it doesn't need to happen immediately. Also we've to keep in mind for
> huge systems the PAGE_SIZE should eventually become 2M and those will
> be able to take advantage of transparent hugepages for the 1G
> pud_trans_huge, that will make HPC even faster. Anyway nothing
> prevents to take Christoph's long term direction also by starting self
> contained.

I want hugepages but not the way you have done it here. Follow conventions
and do not introduce on the fly conversion of page size and do not treat a
huge page as a 2M page while also handling the 4k components as separate
pages. Those create additional synchronization issues (like the compound
lock and the refcounting of tail pages). There are existing ways to
convert from 2M to 4k without these issues (see reclaim logic and page
migration). This would be much cleaner.

I am not sure where your imagination ran wild to make the claim that
hundreds of drivers would have to be changed only because of the use of
proper synchronization methods. I have never said that everything has to
be converted in one go but that it would have to be an incremental
process.

Would you please stop building strawmem and telling wild stories?

> To me what is relevant is that everyone in the VM camp seems to want
> transparent hugepages in some shape or form, because of the about
> linear speedup they provide to everything running on them on bare
> metal (and an more than linear cumulative speedup in case of nested
> pagetables for obvious reasons), no matter what design that it is.

We want huge pages yes. But transparent? If you can define transparent
then we may agree at some point. Certainly not transparent in the sense of
volatile objects that suddenly convert from 2M to 4K sizes causing
breakage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12  6:18                                           ` Pekka Enberg
  2010-04-12  6:48                                             ` Nick Piggin
@ 2010-04-12 14:29                                             ` Christoph Lameter
  2010-04-12 16:06                                               ` Nick Piggin
  1 sibling, 1 reply; 205+ messages in thread
From: Christoph Lameter @ 2010-04-12 14:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Ingo Molnar, Avi Kivity, Mike Galbraith,
	Jason Garrett-Glaser, Andrea Arcangeli, Linus Torvalds,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, 12 Apr 2010, Pekka Enberg wrote:

> > Especially when you use something like SLUB as the memory allocator
> > which requires higher order allocations for objects which are pinned
> > in kernel memory.
>
> I guess we'd need to merge the SLUB defragmentation patches to fix that?

1. SLUB does not require higher order allocations.

2. SLUB defrag patches would allow reclaim / moving of slab memory but
would require callbacks to be provided by slab users to remove references
to objects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 14:24                           ` Christoph Lameter
@ 2010-04-12 14:49                             ` Avi Kivity
  0 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-12 14:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Ingo Molnar, Linus Torvalds, Pekka Enberg,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/12/2010 05:24 PM, Christoph Lameter wrote:
>
>> To me what is relevant is that everyone in the VM camp seems to want
>> transparent hugepages in some shape or form, because of the about
>> linear speedup they provide to everything running on them on bare
>> metal (and an more than linear cumulative speedup in case of nested
>> pagetables for obvious reasons), no matter what design that it is.
>>      
> We want huge pages yes. But transparent? If you can define transparent
> then we may agree at some point. Certainly not transparent in the sense of
> volatile objects that suddenly convert from 2M to 4K sizes causing
> breakage.
>    

Suddenly converting from 2M to 4k is a requirement, otherwise we could 
just use hugetlbfs.

It's simple, we want huge pages when we have the memory and small pages 
when we don't.  Only the kernel knows about memory pressure, so it's up 
to the kernel to break apart and put together those huge pages.

If you have other requirements, they have to come on top, not replace 
our requirements.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 19:40                                                       ` Andrea Arcangeli
@ 2010-04-12 15:41                                                         ` Linus Torvalds
  0 siblings, 0 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-12 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Avi Kivity, Jason Garrett-Glaser, Mike Galbraith,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Arjan van de Ven



On Sun, 11 Apr 2010, Andrea Arcangeli wrote:

> On Sun, Apr 11, 2010 at 08:22:04AM -0700, Linus Torvalds wrote:
> >  - magic libc malloc flags tghat are totally and utterly unrealistic in 
> >    anything but a benchmark
> > 
> >  - by basically keeping one CPU totally busy doing defragmentation.
> 
> This is a red herring. This is the last thing we want, and we'll run
> even faster if we could make current glibc binaries to cooperate. But
> this is a new feature and it'll require changing glibc slightly.

So if it is a red herring, why the hell did you do your numbers with it?

Also, talking about "changing glibc slightly" is another sign of just 
denial of reality. You realize that a lot of apps (especially the ones 
with large VM footprints) do not use glibc malloc at all, exactly because 
it has some bad properties particularly with threading?

I saw people quote firefox mappings in this thread. You realize that 
firefox is one such application?

> Future glibc will be optimal and it won't require khugepaged don't
> worry.

Sure. "All problems are imaginary".

> I got crashes in page_mapcount != number of huge_pmd mapping the page
> in split_huge_page because of the anon-vma bug, so I had to back it
> out, this is why it's stable now.

Ok. My deeper point really was that all the VM people seem to be in this 
circlejerk to improve performance, and it looks like nobody is even trying 
to fix the _existing_ problem (caused by another try to improve 
performance).

I'm totally unimpressed with the whole circus partly exactly due to that.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 14:29                                             ` Christoph Lameter
@ 2010-04-12 16:06                                               ` Nick Piggin
  0 siblings, 0 replies; 205+ messages in thread
From: Nick Piggin @ 2010-04-12 16:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Ingo Molnar, Avi Kivity, Mike Galbraith,
	Jason Garrett-Glaser, Andrea Arcangeli, Linus Torvalds,
	Andrew Morton, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, Apr 12, 2010 at 09:29:03AM -0500, Christoph Lameter wrote:
> On Mon, 12 Apr 2010, Pekka Enberg wrote:
> 
> > > Especially when you use something like SLUB as the memory allocator
> > > which requires higher order allocations for objects which are pinned
> > > in kernel memory.
> >
> > I guess we'd need to merge the SLUB defragmentation patches to fix that?
> 
> 1. SLUB does not require higher order allocations.

The problem is not that it requires higher order allocations. The
problem is that it uses them. It is not a failing higher order
allocation attempt in SLUB that we're worried about here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-11 15:52                                                         ` Linus Torvalds
  2010-04-11 16:04                                                           ` Avi Kivity
  2010-04-11 19:35                                                           ` Andrea Arcangeli
@ 2010-04-12 16:20                                                           ` Rik van Riel
  2010-04-12 16:40                                                             ` Linus Torvalds
  2 siblings, 1 reply; 205+ messages in thread
From: Rik van Riel @ 2010-04-12 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On 04/11/2010 11:52 AM, Linus Torvalds wrote:

> So here's the deal: make the code cleaner, and it's fine. And stop trying
> to sell it with _crap_.

Since none of the hugepages proponents in this thread seem to have
asked this question:

What would you like the code to look like, in order for hugepages
code to be acceptable to you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 16:20                                                           ` Rik van Riel
@ 2010-04-12 16:40                                                             ` Linus Torvalds
  2010-04-12 16:56                                                               ` Linus Torvalds
  2010-04-12 17:36                                                               ` Andrea Arcangeli
  0 siblings, 2 replies; 205+ messages in thread
From: Linus Torvalds @ 2010-04-12 16:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven



On Mon, 12 Apr 2010, Rik van Riel wrote:

> On 04/11/2010 11:52 AM, Linus Torvalds wrote:
> 
> > So here's the deal: make the code cleaner, and it's fine. And stop trying
> > to sell it with _crap_.
> 
> Since none of the hugepages proponents in this thread seem to have
> asked this question:
> 
> What would you like the code to look like, in order for hugepages
> code to be acceptable to you?

So as I already commented to Andrew, the code has no comments about the 
"big picture", and the largest comment I found was about a totally 
_trivial_ issue about replacing the hugepage by first clearing the entry, 
then flushing the tlb, and then filling it.

That needs hardly any comment at all, since that's what we do for _normal_ 
page table entries too when we change anything non-trivial about them. 
That's the anti-thesis of rocket science. Yet that was apparently 
considered the most important thing in the whole core patch to talk about!

And quite frankly, I've been irritated by the "timings" used to sell this 
thing from the start. The changelog for the entry makes a big deal out of 
the fact that there's just a single page fault per 2MB, and that the page 
timing for clearing a huge region is faster the first time because you 
don't take a lot of page faults.

That's a "Duh!" moment too, but it never even talks about the issue of 
"oh, well, we did allocate all those 2M chunks, not knowing whether they 
were going to be used or not".

Sure, it's going to help programs that actually use all of it. Nobody is 
surprised. What I still care about, what what makes _all_ the timings I've 
seen in this whole insane thread pretty much totally useless, is the fact 
that we used to know that what _really_ speeds up a machine is caching. 
Keeping _relevant_ data around so that you don't do IO. And the mantra 
from pretty much day one has been "free memory is wasted memory".

Yet now, the possibility of _truly_ wasting memory isn't apparently even a 
blip on anybody's radar. People blithely talk about changing glibc default 
behavior as if there are absolutely no issues, and 2MB chunks are pocket 
change.

I can pretty much guarantee that every single developer on this list has a 
machine with excessive amounts of memory compared to what the machine is 
actually required to do. And I just do not think that is true in general.

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 16:40                                                             ` Linus Torvalds
@ 2010-04-12 16:56                                                               ` Linus Torvalds
  2010-04-12 17:06                                                                 ` Randy Dunlap
  2010-04-12 17:36                                                               ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Linus Torvalds @ 2010-04-12 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Avi Kivity, Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven



On Mon, 12 Apr 2010, Linus Torvalds wrote:
> 
> So as I already commented to Andrew, the code has no comments about the 
> "big picture", and the largest comment I found was about a totally 
> _trivial_ issue about replacing the hugepage by first clearing the entry, 
> then flushing the tlb, and then filling it.

Btw, this is the same complaint I had about the anon_vma code. There was 
no overview comments, and some of my fixes to that came directly from 
writing a big-picture "what should happen" flow chart, and either noticing 
that the code didn't do what it should have done, or that even the big 
picture was not clear.

And yes, I do realize that historically we (I) haven't been good at those 
things. It's just that the VM has gotten _so_ complicated that we damn 
well need them, at least when we add new features that the rest of the VM 
team doesn't know by rote.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 16:56                                                               ` Linus Torvalds
@ 2010-04-12 17:06                                                                 ` Randy Dunlap
  0 siblings, 0 replies; 205+ messages in thread
From: Randy Dunlap @ 2010-04-12 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Avi Kivity, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Andrea Arcangeli, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On 04/12/10 09:56, Linus Torvalds wrote:
> 
> 
> On Mon, 12 Apr 2010, Linus Torvalds wrote:
>>
>> So as I already commented to Andrew, the code has no comments about the 
>> "big picture", and the largest comment I found was about a totally 
>> _trivial_ issue about replacing the hugepage by first clearing the entry, 
>> then flushing the tlb, and then filling it.
> 
> Btw, this is the same complaint I had about the anon_vma code. There was 
> no overview comments, and some of my fixes to that came directly from 
> writing a big-picture "what should happen" flow chart, and either noticing 
> that the code didn't do what it should have done, or that even the big 
> picture was not clear.
> 
> And yes, I do realize that historically we (I) haven't been good at those 
> things. It's just that the VM has gotten _so_ complicated that we damn 
> well need them, at least when we add new features that the rest of the VM 
> team doesn't know by rote.

and we can't expect Mel (or anyone) to write MM/VM books continuously,
which is what it would take since it's always changing,
so useful comments are the way to go.

-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 16:40                                                             ` Linus Torvalds
  2010-04-12 16:56                                                               ` Linus Torvalds
@ 2010-04-12 17:36                                                               ` Andrea Arcangeli
  2010-04-12 17:46                                                                 ` Rik van Riel
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-12 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Avi Kivity, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On Mon, Apr 12, 2010 at 09:40:54AM -0700, Linus Torvalds wrote:
> Yet now, the possibility of _truly_ wasting memory isn't apparently even a 
> blip on anybody's radar. People blithely talk about changing glibc default 
> behavior as if there are absolutely no issues, and 2MB chunks are pocket 
> change.

This is about enabled=always, in some cases we'll waste memory in the
hope to run faster, correct.

> I can pretty much guarantee that every single developer on this list has a 
> machine with excessive amounts of memory compared to what the machine is 
> actually required to do. And I just do not think that is true in general.

If this is the concern about general use, it's enough to make the
default:

    echo madvise >/sys/kernel/mm/transparent_hugepage/enabled

and then only madvise(MADV_HUGEPAGE) (like qemu guest physical memory)
will use it, and khugepaged will _only_ scan madvise regions. That
guarantees zero RAM waste, and even a 128M embedded definitely should
enable and take advantage of it to squeeze a few cycles away from a
slow CPU. It's a one liner change.

I should make the default selectable at kernel config time, so
developers can keep it =always and distro can set it =madvise (trivial
to switch to "always" during boot or with kernel command line). Right
now it's =always also to give it more testing btw.

Also note about glibc, our target is to replace libhugetlbfs and
pratically make libhugetlbfs the default. Applications calling mmap
and not passing through malloc, or using libs not possible to
override, will also not be able to take advantage of libhugetlbfs so
that's ok. If somebody scatters 4k mappings all over the virtual
address space of a task, I don't like to allocate 2M pages for those
4k virtual mappings (even if it'd be possible to reclaim them pretty
fast without I/O), though even that is theoretically possible. I just
prefer to have a glibc that cooperates, just like libhugetlbfs
cooperates with hugetlbfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 17:36                                                               ` Andrea Arcangeli
@ 2010-04-12 17:46                                                                 ` Rik van Riel
  0 siblings, 0 replies; 205+ messages in thread
From: Rik van Riel @ 2010-04-12 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Avi Kivity, Ingo Molnar, Jason Garrett-Glaser,
	Mike Galbraith, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Arjan van de Ven

On 04/12/2010 01:36 PM, Andrea Arcangeli wrote:

> I should make the default selectable at kernel config time, so
> developers can keep it =always and distro can set it =madvise (trivial
> to switch to "always" during boot or with kernel command line). Right
> now it's =always also to give it more testing btw.

That still means the code will not benefit most applications.

Surely a more benign default behaviour is possible?  For
example, instantiating hugepages on pagefault only in VMAs
that are significantly larger than a hugepage (say, 16MB or
larger?) and not VM_GROWSDOWN (stack starts small).

We can still collapse the small pages into a large page if
the process starts actually using the memory in the VMA.

Memory use is a serious concern for some people, even people
who could really benefit from the hugepages.  For example,
my home desktop system has 12GB RAM, but also runs 3 production
virtual machines (kernelnewbies, PSBL, etc) and often has a
test virtual machine as well.

Not wasting memory is important, since the system is constantly
doing disk IO.  Any memory that is taken away from the page
cache could hurt things.  On the other hand, speeding up the
virtual machines by 6% could be a big help too...

I'd like to think we can find a way to get the best of both
worlds.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-12 10:02                                                       ` Avi Kivity
  2010-04-12 10:08                                                         ` Andrea Arcangeli
  2010-04-12 10:37                                                         ` Nick Piggin
@ 2010-04-13  0:38                                                         ` Andrew Morton
  2010-04-13  6:18                                                           ` Neil Brown
  2 siblings, 1 reply; 205+ messages in thread
From: Andrew Morton @ 2010-04-13  0:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nick Piggin, Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Mon, 12 Apr 2010 13:02:34 +0300 Avi Kivity <avi@redhat.com> wrote:

> The only scenario I can see where it degrades is that you have a dcache 
> load that spills over to all of memory, then falls back leaving a pinned 
> page in every huge frame.  It can happen, but I don't see it as a likely 
> scenario.  But maybe I'm missing something.

<prehistoric memory>

This used to happen fairly easily.  You have a directory tree and some
app which walks down and across it, stat()ing regular files therein. 
So you end up with dentries and inodes which are laid out in memory as
dir-file-file-file-file-...-file-dir-file-...  Then the file
dentries/inodes get reclaimed and you're left with a sparse collection
of directory dcache/icache entries - massively fragmented.

I forget _why_ it happened.  Perhaps because S_ISREG cache items aren't
pinned by anything, but S_ISDIR cache items are pinned by their children
so it takes many more expiry rounds to get rid of them.

There was talk about fixing this, perhaps by using different slab
caches for dirs vs files.  Hard, because the type of the file/inode
isn't known at allocation time.  Nothing happened about it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13  0:38                                                         ` Andrew Morton
@ 2010-04-13  6:18                                                           ` Neil Brown
  2010-04-13 13:31                                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 205+ messages in thread
From: Neil Brown @ 2010-04-13  6:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Avi Kivity, Nick Piggin, Ingo Molnar, Mike Galbraith,
	Jason Garrett-Glaser, Andrea Arcangeli, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael  S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, 12 Apr 2010 20:38:29 -0400
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 12 Apr 2010 13:02:34 +0300 Avi Kivity <avi@redhat.com> wrote:
> 
> > The only scenario I can see where it degrades is that you have a dcache 
> > load that spills over to all of memory, then falls back leaving a pinned 
> > page in every huge frame.  It can happen, but I don't see it as a likely 
> > scenario.  But maybe I'm missing something.
> 
> <prehistoric memory>
> 
> This used to happen fairly easily.  You have a directory tree and some
> app which walks down and across it, stat()ing regular files therein. 
> So you end up with dentries and inodes which are laid out in memory as
> dir-file-file-file-file-...-file-dir-file-...  Then the file
> dentries/inodes get reclaimed and you're left with a sparse collection
> of directory dcache/icache entries - massively fragmented.
> 
> I forget _why_ it happened.  Perhaps because S_ISREG cache items aren't
> pinned by anything, but S_ISDIR cache items are pinned by their children
> so it takes many more expiry rounds to get rid of them.
> 
> There was talk about fixing this, perhaps by using different slab
> caches for dirs vs files.  Hard, because the type of the file/inode
> isn't known at allocation time.  Nothing happened about it.

Actually I don't think that would be hard at all.
->lookup can return a different dentry than the one passed in, usually using
d_splice_alias to find it.
So when you create an inode for a directory, create an anonymous dentry,
attach it via i_dentry, and it should "just work".
That is assuming this is still a "problem" that needs to be "fixed".

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 13:30                                                       ` Andrea Arcangeli
  2010-04-12 13:33                                                         ` Avi Kivity
@ 2010-04-13 11:38                                                         ` Ingo Molnar
  2010-04-13 13:17                                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-13 11:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Arjan van de Ven, Avi Kivity, Jason Garrett-Glaser,
	Mike Galbraith, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote:
> >
> > Now hugepages have some interesting other advantages, namely they save 
> > pagetable memory..which for something like TPC-C on a fork based database 
> > can be a measureable win.
> 
> It doesn't save pagetable memory (as in `grep MemFree /proc/meminfo`). [...]

It does save in terms of CPU cache footprint. (which the argument was about) 
The RAM is wasted, but are always cache cold.

> [...] I think the saving in pagetables isn't really interesting... [...]

i think it's very much interesting for 'pure' hugetlb mappings, as a next-step 
thing. It amounts to 8 bytes wasted per 4K page [0.2% of RAM wasted] - much 
more with the kind of aliasing that DBs frequently do - for hugetlb workloads 
it is basically roughly equivalent to a +8 bytes increase in struct page size 
- few MM hackers would accept that.

So it will have to be fixed down the line.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-13 11:38                                                         ` Ingo Molnar
@ 2010-04-13 13:17                                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-13 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Avi Kivity, Jason Garrett-Glaser,
	Mike Galbraith, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Tue, Apr 13, 2010 at 01:38:25PM +0200, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Mon, Apr 12, 2010 at 04:22:30AM -0700, Arjan van de Ven wrote:
> > >
> > > Now hugepages have some interesting other advantages, namely they save 
> > > pagetable memory..which for something like TPC-C on a fork based database 
> > > can be a measureable win.
> > 
> > It doesn't save pagetable memory (as in `grep MemFree /proc/meminfo`). [...]
> 
> It does save in terms of CPU cache footprint. (which the argument was about) 
> The RAM is wasted, but are always cache cold.

Definitely, thanks for further clarifying this, and this is why I've
been careful to specify "as in `grep MemFree..".

> i think it's very much interesting for 'pure' hugetlb mappings, as a next-step 
> thing. It amounts to 8 bytes wasted per 4K page [0.2% of RAM wasted] - much 
> more with the kind of aliasing that DBs frequently do - for hugetlb workloads 
> it is basically roughly equivalent to a +8 bytes increase in struct page size 
> - few MM hackers would accept that.
> 
> So it will have to be fixed down the line.

It's exactly 4k wasted for each pmd set as pmd_trans_huge. Removing
the pagetable preallocation will be absolutely trivial as far as
huge_memory.c is concerned (takes like 1 minute of hacking) and in
fact it simplifies a bit of the code, what will be not trivial will be
to handle the -ENOMEM retval from every place that calls
split_huge_page_pmd, which definitely we can address down the line
(ideally by removing split_huge_page_pmd). The other benefit the
current preallocation provides, is that it doesn't increase
requirements from the PF_MEMALLOC pool, until we can swap hugepages
natively with huge-swapcache, in order to swap we need to allocate the
pte.

Who tried this before (Dave IIRC) answered some email ago that he also
had to preallocate the pte to avoid running into the above issue. When
he said that, it further confirmed me that it's worth to go this way
initially. Also note: we're not wasting memory compared to when pmd is
not huge, we just don't take advantage of the full potential of
hugepages to keep things more manageable initially.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13  6:18                                                           ` Neil Brown
@ 2010-04-13 13:31                                                             ` Andrea Arcangeli
  2010-04-13 13:40                                                               ` Mel Gorman
  0 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-13 13:31 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Avi Kivity, Nick Piggin, Ingo Molnar,
	Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael  S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

Hi Neil!

On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote:
> Actually I don't think that would be hard at all.
> ->lookup can return a different dentry than the one passed in, usually using
> d_splice_alias to find it.
> So when you create an inode for a directory, create an anonymous dentry,
> attach it via i_dentry, and it should "just work".
> That is assuming this is still a "problem" that needs to be "fixed".

I'm not sure if changing the slab object will make a whole lot of
difference, because antifrag will threat all unmovable stuff the
same. To make a difference directories should go in a different 2M
page of the inodes, and that would require changes to the slab code to
achieve I guess.

However while I doubt it helps with hugepage fragmentation because of
the above, it still sounds a good idea to provide more "free memory"
to the system with less effort and while preserving more cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13 13:31                                                             ` Andrea Arcangeli
@ 2010-04-13 13:40                                                               ` Mel Gorman
  2010-04-13 13:44                                                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 205+ messages in thread
From: Mel Gorman @ 2010-04-13 13:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Neil Brown, Andrew Morton, Avi Kivity, Nick Piggin, Ingo Molnar,
	Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael  S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote:
> Hi Neil!
> 
> On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote:
> > Actually I don't think that would be hard at all.
> > ->lookup can return a different dentry than the one passed in, usually using
> > d_splice_alias to find it.
> > So when you create an inode for a directory, create an anonymous dentry,
> > attach it via i_dentry, and it should "just work".
> > That is assuming this is still a "problem" that needs to be "fixed".
> 
> I'm not sure if changing the slab object will make a whole lot of
> difference, because antifrag will threat all unmovable stuff the
> same.

Anti-frag considers reclaimable slab caches to be different to unmovable
allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE
flag. It was to keep truly unmovable allocations in the same 2M pages where
possible.

It also means that even with large bursts of kernel allocations due to big
filesystem loads, the system will still get some of those 2M blocks back
eventually when slab eventually ages and shrinks.

You can use /proc/pagetypeinfo to get a count of the 2M blocks of each
type for different types of workloads to see what the scenarios look like
from an anti-frag and compaction perspective but very loosly speaking,
with compaction applied, you'd expect to be able to covert all "Movable"
blocks to huge pages by either compacting or paging. You'll get some of the
"Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends
on how many of the allocations are due to pagetables.

> To make a difference directories should go in a different 2M
> page of the inodes, and that would require changes to the slab code to
> achieve I guess.
> 
> However while I doubt it helps with hugepage fragmentation because of
> the above, it still sounds a good idea to provide more "free memory"
> to the system with less effort and while preserving more cache.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13 13:40                                                               ` Mel Gorman
@ 2010-04-13 13:44                                                                 ` Andrea Arcangeli
  2010-04-13 13:55                                                                   ` Mel Gorman
  0 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-13 13:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Neil Brown, Andrew Morton, Avi Kivity, Nick Piggin, Ingo Molnar,
	Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael  S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Tue, Apr 13, 2010 at 02:40:35PM +0100, Mel Gorman wrote:
> On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote:
> > Hi Neil!
> > 
> > On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote:
> > > Actually I don't think that would be hard at all.
> > > ->lookup can return a different dentry than the one passed in, usually using
> > > d_splice_alias to find it.
> > > So when you create an inode for a directory, create an anonymous dentry,
> > > attach it via i_dentry, and it should "just work".
> > > That is assuming this is still a "problem" that needs to be "fixed".
> > 
> > I'm not sure if changing the slab object will make a whole lot of
> > difference, because antifrag will threat all unmovable stuff the
> > same.
> 
> Anti-frag considers reclaimable slab caches to be different to unmovable
> allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE
> flag. It was to keep truly unmovable allocations in the same 2M pages where
> possible.

As long as we keep the reclaimable separated from the "movable" that's
fine.

> It also means that even with large bursts of kernel allocations due to big
> filesystem loads, the system will still get some of those 2M blocks back
> eventually when slab eventually ages and shrinks.

Only if the file isn't open... it's not really certain it's reclaimable.

> You can use /proc/pagetypeinfo to get a count of the 2M blocks of each
> type for different types of workloads to see what the scenarios look like
> from an anti-frag and compaction perspective but very loosly speaking,
> with compaction applied, you'd expect to be able to covert all "Movable"
> blocks to huge pages by either compacting or paging. You'll get some of the
> "Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends
> on how many of the allocations are due to pagetables.

Awesome statistic!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13 13:44                                                                 ` Andrea Arcangeli
@ 2010-04-13 13:55                                                                   ` Mel Gorman
  2010-04-13 14:03                                                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 205+ messages in thread
From: Mel Gorman @ 2010-04-13 13:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Neil Brown, Andrew Morton, Avi Kivity, Nick Piggin, Ingo Molnar,
	Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael  S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Tue, Apr 13, 2010 at 03:44:56PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 13, 2010 at 02:40:35PM +0100, Mel Gorman wrote:
> > On Tue, Apr 13, 2010 at 03:31:54PM +0200, Andrea Arcangeli wrote:
> > > Hi Neil!
> > > 
> > > On Tue, Apr 13, 2010 at 04:18:02PM +1000, Neil Brown wrote:
> > > > Actually I don't think that would be hard at all.
> > > > ->lookup can return a different dentry than the one passed in, usually using
> > > > d_splice_alias to find it.
> > > > So when you create an inode for a directory, create an anonymous dentry,
> > > > attach it via i_dentry, and it should "just work".
> > > > That is assuming this is still a "problem" that needs to be "fixed".
> > > 
> > > I'm not sure if changing the slab object will make a whole lot of
> > > difference, because antifrag will threat all unmovable stuff the
> > > same.
> > 
> > Anti-frag considers reclaimable slab caches to be different to unmovable
> > allocations. Slabs with the SLAB_RECLAIM_ACCOUNT use the __GFP_RECLAIMABLE
> > flag. It was to keep truly unmovable allocations in the same 2M pages where
> > possible.
> 
> As long as we keep the reclaimable separated from the "movable" that's
> fine.
> 

That already happens.

> > It also means that even with large bursts of kernel allocations due to big
> > filesystem loads, the system will still get some of those 2M blocks back
> > eventually when slab eventually ages and shrinks.
> 
> Only if the file isn't open... it's not really certain it's reclaimable.
> 

True. Christoph made a few stabs at being able to slab targetted reclaim
(called defragmentation, but it was about reclaim) but it was never completed
and merged. Even if it was merged, the slab reclaimable objects would
still be kept in their own 2M pageblocks though.

> > You can use /proc/pagetypeinfo to get a count of the 2M blocks of each
> > type for different types of workloads to see what the scenarios look like
> > from an anti-frag and compaction perspective but very loosly speaking,
> > with compaction applied, you'd expect to be able to covert all "Movable"
> > blocks to huge pages by either compacting or paging. You'll get some of the
> > "Reclaimable" blocks if slab is shrunk enough the unmovable blocks depends
> > on how many of the allocations are due to pagetables.
> 
> Awesome statistic!
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-13 13:55                                                                   ` Mel Gorman
@ 2010-04-13 14:03                                                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-13 14:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Neil Brown, Andrew Morton, Avi Kivity, Nick Piggin, Ingo Molnar,
	Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Arnd Bergmann,
	Michael  S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura

On Tue, Apr 13, 2010 at 02:55:43PM +0100, Mel Gorman wrote:
> That already happens.

Yep as shown by /proc/pagetypeinfo.

> True. Christoph made a few stabs at being able to slab targetted reclaim
> (called defragmentation, but it was about reclaim) but it was never completed
> and merged. Even if it was merged, the slab reclaimable objects would
> still be kept in their own 2M pageblocks though.

I guess it's not easy and more expensive to reclaim in use object, I
didn't see the targetted reclaim patches. So it sounds ok if they stay
in their own pageblocks separated from the movable pageblocks, even if
they become fully reclaimable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-12 11:29                                                       ` Avi Kivity
@ 2010-04-17 15:12                                                         ` Arjan van de Ven
  2010-04-17 18:18                                                           ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Arjan van de Ven @ 2010-04-17 15:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Mon, 12 Apr 2010 14:29:58 +0300
Avi Kivity <avi@redhat.com> wrote:
> Pointer chasing defeats OoO.  The cpu is limited in the amount of 
> speculation it can do.

Pointer chasing defeats the CPU cache as well.
As long as the CPU cache contains, mostly, the page tables for all the
data in the cache, applications that try to work good with a cache
don't notice too much. Sure, once you start doing pointer chasing cache
misses things suck. they do very much so.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-17 15:12                                                         ` Arjan van de Ven
@ 2010-04-17 18:18                                                           ` Avi Kivity
  2010-04-17 19:05                                                             ` Arjan van de Ven
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-17 18:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/17/2010 06:12 PM, Arjan van de Ven wrote:
> On Mon, 12 Apr 2010 14:29:58 +0300
> Avi Kivity<avi@redhat.com>  wrote:
>    
>> Pointer chasing defeats OoO.  The cpu is limited in the amount of
>> speculation it can do.
>>      
> Pointer chasing defeats the CPU cache as well.
>    

True.

> As long as the CPU cache contains, mostly, the page tables for all the
> data in the cache, applications that try to work good with a cache
> don't notice too much. Sure, once you start doing pointer chasing cache
> misses things suck. they do very much so.
>    

Correct.  We're trying to reduce suckage from 2 cache misses per access 
(3 for virt), to 1 cache miss per access.  We're also freeing up space 
in the cache for data.

Saying the application already sucks isn't helping anything.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-17 18:18                                                           ` Avi Kivity
@ 2010-04-17 19:05                                                             ` Arjan van de Ven
  2010-04-17 19:05                                                               ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Arjan van de Ven @ 2010-04-17 19:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sat, 17 Apr 2010 21:18:12 +0300
> 
> Correct.  We're trying to reduce suckage from 2 cache misses per
> access (3 for virt), to 1 cache miss per access.  We're also freeing
> up space in the cache for data.
> 
> Saying the application already sucks isn't helping anything.

but the guy who's writing the application will already optimize for
this case...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-17 19:05                                                             ` Arjan van de Ven
@ 2010-04-17 19:05                                                               ` Avi Kivity
  2010-04-17 19:18                                                                 ` Arjan van de Ven
  0 siblings, 1 reply; 205+ messages in thread
From: Avi Kivity @ 2010-04-17 19:05 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/17/2010 10:05 PM, Arjan van de Ven wrote:
> On Sat, 17 Apr 2010 21:18:12 +0300
>    
>> Correct.  We're trying to reduce suckage from 2 cache misses per
>> access (3 for virt), to 1 cache miss per access.  We're also freeing
>> up space in the cache for data.
>>
>> Saying the application already sucks isn't helping anything.
>>      
> but the guy who's writing the application will already optimize for
> this case...
>    

I lost you.  What is he optimizing for?  4k pages?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-17 19:05                                                               ` Avi Kivity
@ 2010-04-17 19:18                                                                 ` Arjan van de Ven
  2010-04-17 19:20                                                                   ` Avi Kivity
  0 siblings, 1 reply; 205+ messages in thread
From: Arjan van de Ven @ 2010-04-17 19:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On Sat, 17 Apr 2010 22:05:31 +0300
Avi Kivity <avi@redhat.com> wrote:

> On 04/17/2010 10:05 PM, Arjan van de Ven wrote:
> > On Sat, 17 Apr 2010 21:18:12 +0300
> >    
> >> Correct.  We're trying to reduce suckage from 2 cache misses per
> >> access (3 for virt), to 1 cache miss per access.  We're also
> >> freeing up space in the cache for data.
> >>
> >> Saying the application already sucks isn't helping anything.
> >>      
> > but the guy who's writing the application will already optimize for
> > this case...
> >    
> 
> I lost you.  What is he optimizing for?  4k pages?

not totally sucking on cache misses, eg trying to do data locality etc
 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: hugepages will matter more in the future
  2010-04-17 19:18                                                                 ` Arjan van de Ven
@ 2010-04-17 19:20                                                                   ` Avi Kivity
  0 siblings, 0 replies; 205+ messages in thread
From: Avi Kivity @ 2010-04-17 19:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Jason Garrett-Glaser, Mike Galbraith,
	Andrea Arcangeli, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-mm, Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura

On 04/17/2010 10:18 PM, Arjan van de Ven wrote:
> On Sat, 17 Apr 2010 22:05:31 +0300
> Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> On 04/17/2010 10:05 PM, Arjan van de Ven wrote:
>>      
>>> On Sat, 17 Apr 2010 21:18:12 +0300
>>>
>>>        
>>>> Correct.  We're trying to reduce suckage from 2 cache misses per
>>>> access (3 for virt), to 1 cache miss per access.  We're also
>>>> freeing up space in the cache for data.
>>>>
>>>> Saying the application already sucks isn't helping anything.
>>>>
>>>>          
>>> but the guy who's writing the application will already optimize for
>>> this case...
>>>
>>>        
>> I lost you.  What is he optimizing for?  4k pages?
>>      
> not totally sucking on cache misses, eg trying to do data locality etc
>    

Of course, but it's not always possible.  Hence Java and Oracle (and 
Linux itself) try to map their data with large pages.

Things like a garbage collector, an LRU, or large object trees that are 
traversed by semi-random input are hard/impossible to localize.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-11  1:05                                         ` Andrea Arcangeli
  2010-04-11 11:24                                           ` Ingo Molnar
@ 2010-04-25 19:27                                           ` Andrea Arcangeli
  2010-04-26 18:01                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-25 19:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Ulrich Drepper

On Sun, Apr 11, 2010 at 03:05:40AM +0200, Andrea Arcangeli wrote:
> With the above two params I get around 200M (around half) in
> hugepages with gcc building translate.o:
> 
> $ rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m22.900s
> user    0m22.601s
> sys     0m0.260s
> $ rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m22.405s
> user    0m22.125s
> sys     0m0.240s
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # exit
> $ rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m24.128s
> user    0m23.725s
> sys     0m0.376s
> $ rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m24.126s
> user    0m23.725s
> sys     0m0.376s
> $ uptime
>  02:36:07 up 1 day, 19:45,  5 users,  load average: 0.01, 0.12, 0.08
> 
> 1 sec in 24 means around 4% faster, hopefully when glibc will fully
> cooperate we'll get better results than the above with gcc...
> 
> I tried to emulate it with khugepaged running in a loop and I get
> almost the whole gcc anon memory in hugepages this way (as expected):
> 
> # echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> # exit
> rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m21.950s
> user    0m21.481s
> sys     0m0.292s
> $ rm translate.o ; time make translate.o
>   CC    translate.o
> 
> real    0m21.992s
> user    0m21.529s
> sys     0m0.288s
> $ 
> 
> So this takes more than 2 seconds away from 24 seconds reproducibly,
> and it means gcc now runs 8% faster. This requires running khugepaged
> at 100% of one of the four cores but with a slight chance to glibc
> we'll be able reach the exact same 8% speedup (or more because this
> also involves copying ~200M and sending IPIs to unmap pages and stop
> userland during the memory copy that won't be necessary anymore).
> 
> BTW, the current default for khugepaged is to scan 8 pmd every 10
> seconds, that means collapsing at most 16M every 10 seconds. Checking
> 8 pmd pointers every 10 seconds and 6 wakeup per minute for a kernel
> thread is absolutely unmeasurable but despite the unmeasurable
> overhead, it provides for a very nice behavior for long lived
> allocations that may have been swapped in fragmented.
> 
> This is on phenom X4, I'd be interested if somebody can try on other cpus.
> 
> To get the environment of the test just:
> 
> git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
> cd qemu-kvm
> make
> cd x86_64-softmmu
> 
> export MALLOC_MMAP_THRESHOLD_=$[1024*1024*1024]
> export MALLOC_TOP_PAD_=$[1024*1024*1024]
> rm translate.o; time make translate.o
> 
> Then you need to flip the above sysfs controls as I did.

I patched gcc with the few liner change and without tweaking glibc and
with khugepaged killed at all times. The system already had heavy load
building glibc a couple of times and my usual kernel build load for
about 12 hours. Shutting down khugepaged isn't really necessary
considering how slow the scan is but I did it anyway.

$ cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never
$ cat /sys/kernel/mm/transparent_hugepage/khugepaged/enabled 
always madvise [never]
$ pgrep khugepaged
$ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing  -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64  -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H   -MMD -MP -MT translate.o -O2 -g  -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c

 Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c' (3 runs):

    55365925618  cycles                     ( +-   0.038% )  (scaled from 66.67%)
    36558135065  instructions             #      0.660 IPC     ( +-   0.061% )  (scaled from 66.66%)
    16103841974  dTLB-loads                 ( +-   0.109% )  (scaled from 66.68%)
            823  dTLB-load-misses           ( +-   0.081% )  (scaled from 66.70%)
    16080393958  L1-dcache-loads            ( +-   0.030% )  (scaled from 66.69%)
      357523292  L1-dcache-load-misses      ( +-   0.099% )  (scaled from 66.68%)

   23.129143516  seconds time elapsed   ( +-   0.035% )

If I tweak glibc:

$ export MALLOC_TOP_PAD_=100000000
$ export MALLOC_MMAP_THRESHOLD_=1000000000
$ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing  -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64  -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H   -MMD -MP -MT translate.o -O2 -g  -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c

 Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c' (3 runs):

    52684457919  cycles                     ( +-   0.059% )  (scaled from 66.67%)
    36392861901  instructions             #      0.691 IPC     ( +-   0.130% )  (scaled from 66.68%)
    16014094544  dTLB-loads                 ( +-   0.152% )  (scaled from 66.67%)
            784  dTLB-load-misses           ( +-   0.450% )  (scaled from 66.69%)
    16030576638  L1-dcache-loads            ( +-   0.161% )  (scaled from 66.70%)
      353904925  L1-dcache-load-misses      ( +-   0.510% )  (scaled from 66.68%)

   22.048837226  seconds time elapsed   ( +-   0.224% )

Then I disabled transparent hugepage (I left the glibc tweak just in
case anyone wonders that with the environment var set, less brk
syscalls run, but it doesn't make any difference without transparent
hugepage regardless of those environment settings).

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
$ set|grep MALLOC
MALLOC_MMAP_THRESHOLD_=1000000000
MALLOC_TOP_PAD_=100000000
_=MALLOC_TOP_PAD_
$ ~/bin/x86_64/perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses -e l1-dcache-loads -e l1-dcache-load-misses --repeat 3 gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing  -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64  -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H   -MMD -MP -MT translate.o -O2 -g  -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c

 Performance counter stats for 'gcc -I/crypto/home/andrea/kernel/qemu-kvm/slirp -Werror -m64 -fstack-protector-all -Wold-style-definition -Wold-style-declaration -I. -I/crypto/home/andrea/kernel/qemu-kvm -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wendif-labels -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -DHAS_AUDIO -DHAS_AUDIO_CHOICE -I/crypto/home/andrea/kernel/qemu-kvm/fpu -I/crypto/home/andrea/kernel/qemu-kvm/tcg -I/crypto/home/andrea/kernel/qemu-kvm/tcg/x86_64 -DTARGET_PHYS_ADDR_BITS=64 -I.. -I/crypto/home/andrea/kernel/qemu-kvm/target-i386 -DNEED_CPU_H -MMD -MP -MT translate.o -O2 -g -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include -include /crypto/home/andrea/kernel/qemu-kvm/kvm/include/linux/config.h -I/crypto/home/andrea/kernel/qemu-kvm/kvm/include/x86 -idirafter /crypto/home/andrea/kernel/qemu-kvm/compat -c -o translate.o /crypto/home/andrea/kernel/qemu-kvm/target-i386/translate.c' (3 runs):

    58193692408  cycles                     ( +-   0.129% )  (scaled from 66.66%)
    36565168786  instructions             #      0.628 IPC     ( +-   0.052% )  (scaled from 66.68%)
    16098510972  dTLB-loads                 ( +-   0.223% )  (scaled from 66.69%)
            867  dTLB-load-misses           ( +-   0.168% )  (scaled from 66.69%)
    16186049665  L1-dcache-loads            ( +-   0.112% )  (scaled from 66.69%)
      364792323  L1-dcache-load-misses      ( +-   0.145% )  (scaled from 66.66%)

   24.313032086  seconds time elapsed   ( +-   0.154% )

(24.31-22.04)/22.04 = 10.2% boost (or 9.3% faster if you divide it by
24.31 ;).

Ulrich also sent me a snippnet to align the region in glibc, I tried
it but it doesn't get faster than with the environment vars above so
the above is simpler than having to rebuild glibc for benchmarking
(plus I was unsure if this snippnet really works as well as the two
env variables, so I used an unmodified stock glibc for this test).

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 722b1d4..b067b65 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3168,6 +3168,10 @@ static Void_t* sYSMALLOc(nb, av) INTERNAL_SIZE_T nb; mstate av;
 
   size = nb + mp_.top_pad + MINSIZE;
 
+#define TWOM (2*1024*1024)
+  char *cur = (char*)MORECORE(0);
+  size = (char*)((size_t)(cur + size + TWOM - 1)&~(TWOM-1))-cur;
+
   /*
     If contiguous, we can subtract out existing space that we hope to
     combine with new space. We add it back later only if


Now that my gcc in my workstation hugepage-friendly I can test a
kernel compile and see if I get any boost with that too, before it was
just impossible.

Also note: if you read ggc-page.c or glibc malloc.c you'll notice
things like GGC_QUIRE_SIZE, and all sort of other alignment and
multipage heuristics there. So it's absolutely guaranteed the moment
the kernel gets transparent hugepages they will add the few liner
change to get the guaranteed boost at least for the 2M size
allocations, like they already do to rate-limit the number of syscalls
and all other alignment tricks they do for the cache etc.. Talking
about gcc and glibc changes in this context is very real IMHO and I
think it's much superior solution than having mmap(4k) backed by 2M
pages with all complexity and additional branches it'd introduce in
all page faults (not just in a single large mmap which is a slow
path).

What we can add to the kernel, an idea that Ulrich proposed, is a mmap
MMAP_ALIGN parameter to mmap, so that the first argument of mmap
becomes the alignment. That creates more vmas but the below munmap
does too. It's simply mandatory that 2M size alignment allocations
starts 2M aligned from now on (the rest is handled by khugepaged
already including the very user stack). To avoid fragmenting the
virtual address space and in turn creating more vmas (and potentially
micro-slowing-down the page faults) probably these allocations
multiple of 2M in size and 2M aligned could go in their own address,
something a MAP_ALIGN param can achieve inside the kernel
transparently. Of course if userland munmap(4k) it'll fragment but
that's up to userland to munmap also in aligned chunks multiple of 2m,
if it wants to be optimal and avoid vma creation.

The kernel used is aa.git fb6122f722c9e07da384c1309a5036a5f1c80a77 on
single socket 4 cores phenom X4 4G of 800mhz ddr2 as before (and no virt).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

--- /var/tmp/portage/sys-devel/gcc-4.4.2/work/gcc-4.4.2/gcc/ggc-page.c	2008-07-28 16:33:56.000000000 +0200
+++ /tmp/gcc-4.4.2/gcc/ggc-page.c	2010-04-25 06:01:32.829753566 +0200
@@ -450,6 +450,11 @@
 #define BITMAP_SIZE(Num_objects) \
   (CEIL ((Num_objects), HOST_BITS_PER_LONG) * sizeof(long))
 
+#ifdef __x86_64__
+#define HPAGE_SIZE (2*1024*1024)
+#define GGC_QUIRE_SIZE 512
+#endif
+
 /* Allocate pages in chunks of this size, to throttle calls to memory
    allocation routines.  The first page is used, the rest go onto the
    free list.  This cannot be larger than HOST_BITS_PER_INT for the
@@ -654,6 +659,23 @@
 #ifdef HAVE_MMAP_ANON
   char *page = (char *) mmap (pref, size, PROT_READ | PROT_WRITE,
 			      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+#ifdef HPAGE_SIZE
+  if (!(size & (HPAGE_SIZE-1)) &&
+      page != (char *) MAP_FAILED && (size_t) page & (HPAGE_SIZE-1)) {
+	  char *old_page;
+	  munmap(page, size);
+	  page = (char *) mmap (pref, size + HPAGE_SIZE-1,
+				PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	  old_page = page;
+	  page = (char *) (((size_t)page + HPAGE_SIZE-1)
+			   & ~(HPAGE_SIZE-1));
+	  if (old_page != page)
+		  munmap(old_page, page-old_page);
+	  if (page != old_page + HPAGE_SIZE-1)
+		  munmap(page+size, old_page+HPAGE_SIZE-1-page);
+  }
+#endif
 #endif
 #ifdef HAVE_MMAP_DEV_ZERO
   char *page = (char *) mmap (pref, size, PROT_READ | PROT_WRITE,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-25 19:27                                           ` Andrea Arcangeli
@ 2010-04-26 18:01                                             ` Andrea Arcangeli
  2010-04-30  9:55                                               ` Ingo Molnar
  0 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-26 18:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Mike Galbraith, Jason Garrett-Glaser,
	Linus Torvalds, Pekka Enberg, Andrew Morton, linux-mm,
	Marcelo Tosatti, Adam Litke, Izik Eidus, Hugh Dickins,
	Nick Piggin, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Mike Travis, KAMEZAWA Hiroyuki,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Arnd Bergmann, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Ulrich Drepper

Now tried with a kernel compile with gcc patched as in prev email
(stock glibc and no glibc environment parameters). Without rebooting
(still plenty of hugepages as usual).

always:

real    4m7.280s
real    4m7.520s

never:

real    4m13.754s
real    4m14.095s

So the kernel now builds 2.3% faster. As expected nothing huge here
because of gcc not using several hundred hundred mbytes of ram (unlike
translate.o or other more pathological files), and there's lots of
cpu time spent not just in gcc.

Clearly this is not done for gcc (but for JVM and other workloads with
larger working sets), but even a kernel build running more than 2%
faster I think is worth mentioning as it confirms we're heading
towards the right direction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-26 18:01                                             ` Andrea Arcangeli
@ 2010-04-30  9:55                                               ` Ingo Molnar
  2010-04-30 15:19                                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 205+ messages in thread
From: Ingo Molnar @ 2010-04-30  9:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Ulrich Drepper


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> Now tried with a kernel compile with gcc patched as in prev email
> (stock glibc and no glibc environment parameters). Without rebooting
> (still plenty of hugepages as usual).
> 
> always:
> 
> real    4m7.280s
> real    4m7.520s
> 
> never:
> 
> real    4m13.754s
> real    4m14.095s
> 
> So the kernel now builds 2.3% faster. As expected nothing huge here
> because of gcc not using several hundred hundred mbytes of ram (unlike
> translate.o or other more pathological files), and there's lots of
> cpu time spent not just in gcc.
> 
> Clearly this is not done for gcc (but for JVM and other workloads with
> larger working sets), but even a kernel build running more than 2%
> faster I think is worth mentioning as it confirms we're heading
> towards the right direction.

Was this done on a native/host kernel?

I.e. do everyday kernel hackers gain 2.3% of kbuild performance from this?

I find that a very large speedup - it's much more than what i'd have expected.

Are you absolutely 100% sure it's real? If yes, it would be nice to underline 
that by gathering some sort of 'perf stat --repeat 3 --all' kind of 
always/never comparison of those kernel builds, so that we can see where the 
+2.3% comes from.

I'd expect to see roughly the same instruction count (within noise), but a ~3% 
reduced cycle count (due to fewer/faster TLB fills).

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-30  9:55                                               ` Ingo Molnar
@ 2010-04-30 15:19                                                 ` Andrea Arcangeli
  2010-05-02 12:17                                                   ` Ingo Molnar
  0 siblings, 1 reply; 205+ messages in thread
From: Andrea Arcangeli @ 2010-04-30 15:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Ulrich Drepper, Paolo Bonzini

On Fri, Apr 30, 2010 at 11:55:43AM +0200, Ingo Molnar wrote:
> 
> * Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Now tried with a kernel compile with gcc patched as in prev email
> > (stock glibc and no glibc environment parameters). Without rebooting
> > (still plenty of hugepages as usual).
> > 
> > always:
> > 
> > real    4m7.280s
> > real    4m7.520s
> > 
> > never:
> > 
> > real    4m13.754s
> > real    4m14.095s
> > 
> > So the kernel now builds 2.3% faster. As expected nothing huge here
> > because of gcc not using several hundred hundred mbytes of ram (unlike
> > translate.o or other more pathological files), and there's lots of
> > cpu time spent not just in gcc.
> > 
> > Clearly this is not done for gcc (but for JVM and other workloads with
> > larger working sets), but even a kernel build running more than 2%
> > faster I think is worth mentioning as it confirms we're heading
> > towards the right direction.
> 
> Was this done on a native/host kernel?

Correct, no virt, just bare metal.

> I.e. do everyday kernel hackers gain 2.3% of kbuild performance from this?

Yes I already get benefit from this in my work.

> 
> I find that a very large speedup - it's much more than what i'd have expected.
>
> Are you absolutely 100% sure it's real? If yes, it would be nice to underline 

200% sure, at least on the phenom X4 with 1 socket 4 cores and 800mhz
ddr2 ram! Why don't you try yourself? You've just to use aa.git + the
gcc patch I posted applied to gcc and nothing else. This is what I'm
using in all my systems to actively benefit from it already.

I've also seen numbers on JVM benchmarks even on host much bigger than
10% with zero userland modifications (as long as the allocation is
done in big chunks everything works automatic and critical regions are
usually allocated in big chunks, even gcc has the GGC_QUIRE_SIZE but
it had to be tuned from 1m to 2m and aligned).

The only crash I had was the one I fixed in the last release that was
a race between migrate.c and exec.c that would trigger even without
THP or memory compaction, I had zero problems so far.

> that by gathering some sort of 'perf stat --repeat 3 --all' kind of 
> always/never comparison of those kernel builds, so that we can see where the 
> +2.3% comes from.

I can do that. I wasn't sure if perf would deal well with such a macro
benchmark, I didn't try yet.

> I'd expect to see roughly the same instruction count (within noise), but a ~3% 
> reduced cycle count (due to fewer/faster TLB fills).

Also note, before I did the few liner patch to gcc so it always use
transparent hugepages in its garbage collector code, the kernel build
was a little slower with transparent hugepage = always. The reason is
likely that make or cpp or gcc itself, were trashing the cache in
hugepage cows for data accesses that didn't benefit from the hugetlb,
that's my best estimate. Faulting more than 4k at time is not always
beneficial for cows, this is why it's pointless to try to implement
any optimistic prefault logic, because it can backfired on you by just
trashing the cache more. My design ensures every single time we
optimistically fault 2m at once, we also get more than just that
optimistic-fault initial speedup (and unwanted cache trashing and more
latency in the fault because of larger clear-page copy-page) but we
get _much_ more and longstanding: the hugetlb and faster tlb miss. I
never pay the cost of optimistic fault, unless I get a _lot_ more in
return than just entering/exiting the kernel fewer times. In fact the
moment gcc uses hugepages it's not like such cow-cache-trashing cost
goes away, but hugepages TLB effect likely leads to >2.3% gain but
part of it is spent in offseting any minor slowdown in the cows. I
also suspect that with enabled=madvise and madvise called by gcc
ggc-page.c, things may be even faster than 2.3% in fact. But it
entirely depends on the cpu cache sizes, on xeon it may be bigger than
2.3% gain as the cache trashing may not materialize there anywhere, so
I'm sticking to the always option.

Paolo has been very nice sending the gcc extreme tests too, those may
achieve > 10% speedups (considering translate.o of qemu is at 10%
speedup already). I just didn't run those yet because translate.o was
much closer to real life scenario (in fact it is real life for the
better or the worse), but in the future I'll try those gcc tests too
as they're emulating what a real app will have to do in similar
circumstances. They're pathological for gcc, but business as usual for
everything else HPC.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 00 of 41] Transparent Hugepage Support #17
  2010-04-30 15:19                                                 ` Andrea Arcangeli
@ 2010-05-02 12:17                                                   ` Ingo Molnar
  0 siblings, 0 replies; 205+ messages in thread
From: Ingo Molnar @ 2010-05-02 12:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Mike Galbraith, Jason Garrett-Glaser, Linus Torvalds,
	Pekka Enberg, Andrew Morton, linux-mm, Marcelo Tosatti,
	Adam Litke, Izik Eidus, Hugh Dickins, Nick Piggin, Rik van Riel,
	Mel Gorman, Dave Hansen, Benjamin Herrenschmidt, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Arnd Bergmann, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura,
	Ulrich Drepper, Paolo Bonzini


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> > I find that a very large speedup - it's much more than what i'd have 
> > expected.
> >
> > Are you absolutely 100% sure it's real? If yes, it would be nice to 
> > underline
> 
> 200% sure, at least on the phenom X4 with 1 socket 4 cores and 800mhz ddr2 
> ram! Why don't you try yourself? You've just to use aa.git + the gcc patch I 
> posted applied to gcc and nothing else. This is what I'm using in all my 
> systems to actively benefit from it already.

Well, patching GCC (and then praying for GCC to actually build & work in a 
full toolchain) is not something that's done easily within a few minutes.

I might try it, i just wanted to point out ways how you can make the numbers 
more convincing to people who dont try out your patches first hand. Was just a 
suggestion.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 205+ messages in thread

end of thread, other threads:[~2010-05-02 12:18 UTC | newest]

Thread overview: 205+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-02  0:41 [PATCH 00 of 41] Transparent Hugepage Support #17 Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 01 of 41] define MADV_HUGEPAGE Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 02 of 41] compound_lock Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 03 of 41] alter compound get_page/put_page Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 04 of 41] update futex compound knowledge Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 05 of 41] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 06 of 41] clear compound mapping Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 07 of 41] add native_set_pmd_at Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 08 of 41] add pmd paravirt ops Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 09 of 41] no paravirt version of pmd ops Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 10 of 41] export maybe_mkwrite Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 11 of 41] comment reminder in destroy_compound_page Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 12 of 41] config_transparent_hugepage Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 13 of 41] special pmd_trans_* functions Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 14 of 41] add pmd mangling generic functions Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 15 of 41] add pmd mangling functions to x86 Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 16 of 41] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 17 of 41] pte alloc trans splitting Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 18 of 41] add pmd mmu_notifier helpers Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 19 of 41] clear page compound Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 20 of 41] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 21 of 41] split_huge_page_mm/vma Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 22 of 41] split_huge_page paging Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 23 of 41] clear_copy_huge_page Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 24 of 41] kvm mmu transparent hugepage support Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 25 of 41] _GFP_NO_KSWAPD Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 26 of 41] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 27 of 41] transparent hugepage core Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 28 of 41] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 29 of 41] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 30 of 41] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 31 of 41] memcg compound Andrea Arcangeli
2010-04-02  0:41 ` [PATCH 32 of 41] memcg huge memory Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 33 of 41] transparent hugepage vmstat Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 34 of 41] khugepaged Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 35 of 41] skip transhuge pages in ksm for now Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 36 of 41] remove PG_buddy Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 37 of 41] add x86 32bit support Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 38 of 41] mincore transparent hugepage support Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 39 of 41] add pmd_modify Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 40 of 41] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-04-02  0:42 ` [PATCH 41 of 41] mprotect: transparent huge page support Andrea Arcangeli
2010-04-05 19:09 ` [PATCH 00 of 41] Transparent Hugepage Support #17 Andrew Morton
2010-04-05 19:36   ` Ingo Molnar
2010-04-05 20:26     ` Pekka Enberg
2010-04-05 20:32       ` Linus Torvalds
2010-04-05 20:46         ` Pekka Enberg
2010-04-05 20:58           ` Linus Torvalds
2010-04-05 21:54             ` Ingo Molnar
2010-04-05 23:21             ` Andrea Arcangeli
2010-04-06  0:26               ` Linus Torvalds
2010-04-06  1:08                 ` [RFD] " Linus Torvalds
2010-04-06  1:26                   ` Andrea Arcangeli
2010-04-06  1:35                   ` Linus Torvalds
2010-04-06  1:13                 ` Andrea Arcangeli
2010-04-06  1:38                   ` Linus Torvalds
2010-04-06  2:23                     ` Linus Torvalds
2010-04-06  5:25                       ` Nick Piggin
2010-04-06  9:08                       ` Ingo Molnar
2010-04-06  9:13                         ` Ingo Molnar
2010-04-10 18:47                         ` Andrea Arcangeli
2010-04-10 19:02                           ` Ingo Molnar
2010-04-10 19:22                             ` Avi Kivity
2010-04-10 19:47                               ` Ingo Molnar
2010-04-10 20:00                                 ` Andrea Arcangeli
2010-04-10 20:10                                   ` Andrea Arcangeli
2010-04-10 20:21                                   ` Jason Garrett-Glaser
2010-04-10 20:24                                 ` Avi Kivity
2010-04-10 20:42                                   ` Avi Kivity
2010-04-10 20:47                                     ` Andrea Arcangeli
2010-04-10 21:00                                       ` Avi Kivity
2010-04-10 21:47                                         ` Andrea Arcangeli
2010-04-11  1:05                                         ` Andrea Arcangeli
2010-04-11 11:24                                           ` Ingo Molnar
2010-04-11 11:33                                             ` Avi Kivity
2010-04-11 12:11                                               ` Ingo Molnar
2010-04-25 19:27                                           ` Andrea Arcangeli
2010-04-26 18:01                                             ` Andrea Arcangeli
2010-04-30  9:55                                               ` Ingo Molnar
2010-04-30 15:19                                                 ` Andrea Arcangeli
2010-05-02 12:17                                                   ` Ingo Molnar
2010-04-10 20:49                                     ` Jason Garrett-Glaser
2010-04-10 20:53                                       ` Avi Kivity
2010-04-10 20:58                                         ` Jason Garrett-Glaser
2010-04-11  9:29                                         ` Avi Kivity
2010-04-11  9:37                                           ` Jason Garrett-Glaser
2010-04-11  9:40                                             ` Avi Kivity
2010-04-11 10:22                                               ` Jason Garrett-Glaser
2010-04-11 11:00                                               ` Ingo Molnar
2010-04-11 11:19                                                 ` Avi Kivity
2010-04-11 11:30                                                   ` Jason Garrett-Glaser
2010-04-11 11:52                                                   ` hugepages will matter more in the future Ingo Molnar
2010-04-11 12:01                                                     ` Avi Kivity
2010-04-11 12:35                                                       ` Ingo Molnar
2010-04-11 15:22                                                     ` Linus Torvalds
2010-04-11 15:43                                                       ` Avi Kivity
2010-04-11 15:52                                                         ` Linus Torvalds
2010-04-11 16:04                                                           ` Avi Kivity
2010-04-12  7:45                                                             ` Ingo Molnar
2010-04-12  8:14                                                               ` Nick Piggin
2010-04-12  8:22                                                                 ` Ingo Molnar
2010-04-12  8:34                                                                   ` Nick Piggin
2010-04-12  8:47                                                                     ` Avi Kivity
2010-04-12  8:45                                                                 ` Andrea Arcangeli
2010-04-11 19:35                                                           ` Andrea Arcangeli
2010-04-12 16:20                                                           ` Rik van Riel
2010-04-12 16:40                                                             ` Linus Torvalds
2010-04-12 16:56                                                               ` Linus Torvalds
2010-04-12 17:06                                                                 ` Randy Dunlap
2010-04-12 17:36                                                               ` Andrea Arcangeli
2010-04-12 17:46                                                                 ` Rik van Riel
2010-04-11 19:40                                                       ` Andrea Arcangeli
2010-04-12 15:41                                                         ` Linus Torvalds
2010-04-12 11:22                                                     ` Arjan van de Ven
2010-04-12 11:29                                                       ` Avi Kivity
2010-04-17 15:12                                                         ` Arjan van de Ven
2010-04-17 18:18                                                           ` Avi Kivity
2010-04-17 19:05                                                             ` Arjan van de Ven
2010-04-17 19:05                                                               ` Avi Kivity
2010-04-17 19:18                                                                 ` Arjan van de Ven
2010-04-17 19:20                                                                   ` Avi Kivity
2010-04-12 13:30                                                       ` Andrea Arcangeli
2010-04-12 13:33                                                         ` Avi Kivity
2010-04-12 13:39                                                           ` Andrea Arcangeli
2010-04-12 13:53                                                             ` Avi Kivity
2010-04-13 11:38                                                         ` Ingo Molnar
2010-04-13 13:17                                                           ` Andrea Arcangeli
2010-04-11 10:46                                   ` [PATCH 00 of 41] Transparent Hugepage Support #17 Ingo Molnar
2010-04-11 10:49                                     ` Ingo Molnar
2010-04-11 11:30                                     ` Avi Kivity
2010-04-11 12:08                                       ` Ingo Molnar
2010-04-11 12:24                                         ` Avi Kivity
2010-04-11 12:46                                           ` Ingo Molnar
2010-04-12  6:09                                         ` Nick Piggin
2010-04-12  6:18                                           ` Pekka Enberg
2010-04-12  6:48                                             ` Nick Piggin
2010-04-12 14:29                                             ` Christoph Lameter
2010-04-12 16:06                                               ` Nick Piggin
2010-04-12  6:36                                           ` Avi Kivity
2010-04-12  6:55                                             ` Ingo Molnar
2010-04-12  7:15                                             ` Nick Piggin
2010-04-12  7:45                                               ` Avi Kivity
2010-04-12  8:28                                                 ` Nick Piggin
2010-04-12  9:01                                                   ` Andrea Arcangeli
2010-04-12  9:03                                                   ` Avi Kivity
2010-04-12  9:26                                                     ` Nick Piggin
2010-04-12  9:39                                                       ` Andrea Arcangeli
2010-04-12 10:02                                                       ` Avi Kivity
2010-04-12 10:08                                                         ` Andrea Arcangeli
2010-04-12 10:10                                                           ` Avi Kivity
2010-04-12 10:23                                                             ` Andrea Arcangeli
2010-04-12 10:37                                                         ` Nick Piggin
2010-04-12 10:59                                                           ` Avi Kivity
2010-04-12 12:23                                                             ` Avi Kivity
2010-04-12 13:25                                                             ` Andrea Arcangeli
2010-04-13  0:38                                                         ` Andrew Morton
2010-04-13  6:18                                                           ` Neil Brown
2010-04-13 13:31                                                             ` Andrea Arcangeli
2010-04-13 13:40                                                               ` Mel Gorman
2010-04-13 13:44                                                                 ` Andrea Arcangeli
2010-04-13 13:55                                                                   ` Mel Gorman
2010-04-13 14:03                                                                     ` Andrea Arcangeli
2010-04-12  7:51                                               ` Ingo Molnar
2010-04-12  7:18                                             ` Andrea Arcangeli
2010-04-12  6:49                                           ` Ingo Molnar
2010-04-12  7:35                                             ` Andrea Arcangeli
2010-04-12  7:08                                           ` Andrea Arcangeli
2010-04-12  7:21                                             ` Nick Piggin
2010-04-12  7:50                                               ` Avi Kivity
2010-04-12  8:07                                                 ` Ingo Molnar
2010-04-12  8:21                                                   ` Andrea Arcangeli
2010-04-12 10:27                                                   ` Mel Gorman
2010-04-12  8:18                                                 ` Andrea Arcangeli
2010-04-12  8:06                                               ` Andrea Arcangeli
2010-04-12 10:44                                                 ` Mel Gorman
2010-04-12 11:12                                                   ` Avi Kivity
2010-04-12 13:17                                                   ` Andrea Arcangeli
2010-04-12 14:24                           ` Christoph Lameter
2010-04-12 14:49                             ` Avi Kivity
2010-04-06  9:55                       ` Avi Kivity
2010-04-06  9:57                         ` Avi Kivity
2010-04-06 11:55                         ` Avi Kivity
2010-04-06 13:10                           ` Nick Piggin
2010-04-06 13:22                             ` Avi Kivity
2010-04-06 13:45                               ` Nick Piggin
2010-04-06 13:57                                 ` Avi Kivity
2010-04-06 16:50                                 ` Andrea Arcangeli
2010-04-06 17:31                                   ` Avi Kivity
2010-04-06 18:00                                     ` Christoph Lameter
2010-04-06 18:04                                       ` Avi Kivity
2010-04-06 18:47                                 ` Avi Kivity
2010-04-06 14:44                             ` Rik van Riel
2010-04-06 16:43                             ` Andrea Arcangeli
2010-04-06  9:30               ` Mel Gorman
2010-04-06 10:32                 ` Theodore Tso
2010-04-06 11:16                   ` Mel Gorman
2010-04-06 13:13                     ` Theodore Tso
2010-04-06 14:55                       ` Mel Gorman
2010-04-06 16:46                       ` Andrea Arcangeli
2010-04-05 21:01         ` Chris Mason
2010-04-05 21:18           ` Avi Kivity
2010-04-05 21:33             ` Linus Torvalds
2010-04-05 22:33               ` Chris Mason
2010-04-06  8:30             ` Mel Gorman
2010-04-06 11:35               ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.