All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Transparent Hugepage support
@ 2009-10-26 18:51 Andrea Arcangeli
  2009-10-27 15:41 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-26 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

Hello,

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
   null)

The first (and more tedious) part of this work requires allowing the
VM to handle anonymous hugepages mixed with regular pages
transparently on regular anonymous vmas. This is what this patch tries
to achieve in the least intrusive possible way. We want hugepages and
hugetlb to be used in a way so that all applications can benefit
without changes (as usual we leverage the KVM virtualization design:
by improving the Linux VM at large, KVM gets the performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

For now the transparent_hugepage sysctl is for debug only (it'll be
moved to sysfs so that the kernel daemon that collapse huge pages will
be tuned from the same directory too), and we need more stats (notably
the split_huge_page* from smaps has to be removed and the amount of
hugepages in each vma should become visible in smaps too). Adam
expressed the interest to add hugepage visibility in pagemap too.

The default I like is that transparent hugepages are used at page
fault time if they're available in O(1) in the buddy. This can be
disabled via sysctl/sysfs setting the value to 0, and if it is
disabled they will only be used inside MADV_HUGEPAGE
regions. MADV_HUGEPAGE regions will do a lot more effort to shrink
caches to create hugepages during the page fault too and not only
through the collapse_huge_page kernel daemon. Then a future
sysctl/sysfs value of 2 tune can force all page faults to do a lot of
efforts to defrag cache and create hugepages whenever possible while
still leaving the collapse_huge_page daemon working strictly in
MADV_HUGEPAGE regions. Obviously KVM will call madvise(MADV_HUGEPAGE)
right after the other madvise it's already running on the guest
physical memory host virtual ranges. Ideally the daemon could run
system-wide too but I think that would tend to waste some CPU but it
remains a possibility and an heuristic would be to timestamp the vma
creation and start to call collapse_huge_page from the oldest vmas.

The pmd_trans_frozen/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

The KVM patch that enables KVM to run on transparent hugepages will
follow later (Marcelo apparently already run KVM with hugepages on top
of this ;).

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

There are likely still many missing things, especially in the basic
accounting area (overcommit/anon-rss) I didn't pay much attention to
and lots of cleanups possible (including perhaps splitting the patch
as usual to make merging simpler). This is still a RFC after all...

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity to save a little
bit of memory and some cache during app startup, but they still don't
improve substantially the cache-trashing during startup (not as good
as only 4k). One reason is that those 4k pte entries copied are still
mapped on a perfectly cache-colored hugepage, so the trashing is the
worst one can generate in those copies (cow of 4k page copies aren't
so well colored so they trashes less, but again this results in
software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular
anon pages and the not-yet-cowed pte entries were pointing in the
middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the
future with larger l2 caches, and the prefault logic would blot the VM
a lot. If one is emebdded and can't handle the sysctl to be 1 by
default because of cache trashing effects during page faults, it is
simple enough to just disable transparent hugepage globally and let
hugepages be allocated only in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the
kernel daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone... maybe we can
achieve mixed page size of them with a small change, maybe not. I
didn't think much about it so far...

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3 
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3 
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3 
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage 
vmx andrea # ./largepages3 
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Comments welcome, thanks!

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -449,6 +449,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -456,6 +461,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -557,6 +568,16 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -266,10 +266,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -31,6 +31,11 @@ static inline void native_set_pte(pte_t 
 	ptep->pte_low = pte.pte_low;
 }
 
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	pmdp->pmd = pmd.pmd;
+}
+
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -31,6 +31,7 @@ extern struct list_head pgd_list;
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -55,6 +56,8 @@ extern struct list_head pgd_list;
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)
@@ -90,11 +93,21 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
 }
 
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
 static inline int pte_file(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_FILE;
@@ -145,6 +158,13 @@ static inline pte_t pte_set_flags(pte_t 
 	return native_make_pte(v | set);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
 static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 {
 	pteval_t v = native_pte_val(pte);
@@ -152,6 +172,13 @@ static inline pte_t pte_clear_flags(pte_
 	return native_make_pte(v & ~clear);
 }
 
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
 static inline pte_t pte_mkclean(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -162,11 +189,21 @@ static inline pte_t pte_mkold(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkexec(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_NX);
@@ -177,16 +214,41 @@ static inline pte_t pte_mkdirty(pte_t pt
 	return pte_set_flags(pte, _PAGE_DIRTY);
 }
 
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
 static inline pte_t pte_mkyoung(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkfreeze(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 static inline pte_t pte_mkwrite(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte_set_flags(pte, _PAGE_PSE);
@@ -315,6 +377,11 @@ static inline int pte_same(pte_t a, pte_
 	return a.pte == b.pte;
 }
 
+static inline int pmd_same(pmd_t a, pmd_t b)
+{
+	return a.pmd == b.pmd;
+}
+
 static inline int pte_present(pte_t a)
 {
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
@@ -330,6 +397,24 @@ static inline int pmd_present(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_PRESENT;
 }
 
+static inline int pmd_trans_frozen(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return !pmd_present(pmd);
+#else
+	return 0;
+#endif
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_val(pmd) & _PAGE_PSE;
+#else
+	return 0;
+#endif
+}
+
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
@@ -346,7 +431,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
@@ -367,6 +452,7 @@ static inline unsigned long pmd_index(un
  * to linux/mm.h:page_to_nid())
  */
 #define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
 /*
  * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE]
@@ -526,6 +612,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which
@@ -557,14 +649,21 @@ struct vm_area_struct;
 extern int ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 				     unsigned long addr, pte_t *ptep);
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
@@ -575,6 +674,14 @@ static inline pte_t ptep_get_and_clear(s
 	return pte;
 }
 
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep,
@@ -601,6 +708,16 @@ static inline void ptep_set_wrprotect(st
 	pte_update(mm, addr, ptep);
 }
 
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmd);
+}
+
+extern void pmdp_freeze_flush(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp);
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -71,6 +71,18 @@ static inline pte_t native_ptep_get_and_
 	return ret;
 #endif
 }
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(NULL, 0, xp);
+	return ret;
+#endif
+}
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -422,8 +422,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_mm(mm, 0xA0000, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -128,6 +128,10 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page)) {
+			VM_BUG_ON(atomic_read(&page->_count) < 0);
+			atomic_inc(&page->_count);
+		}
 		(*nr)++;
 		page++;
 		refs++;
@@ -148,7 +152,7 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -288,6 +288,23 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+	}
+
+	return changed;
+}
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -303,6 +320,21 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -315,6 +347,33 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+
+	return young;
+}
+
+void pmdp_freeze_flush(struct vm_area_struct *vma,
+		       unsigned long address, pmd_t *pmdp)
+{
+	int cleared;
+	VM_BUG_ON(address & ~HPAGE_MASK);
+	cleared = test_and_clear_bit(_PAGE_BIT_PRESENT,
+				     (unsigned long *)&pmdp->pmd);
+	if (cleared) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_SIZE);
+	}
+}
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -23,6 +23,19 @@
 	}								  \
 	__changed;							  \
 })
+
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_MASK);			\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_SIZE);	\
+		}							\
+		__changed;						\
+	})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
@@ -37,6 +50,17 @@
 			   (__ptep), pte_mkold(__pte));			\
 	r;								\
 })
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -48,6 +72,16 @@
 		flush_tlb_page(__vma, __address);			\
 	__young;							\
 })
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_SIZE);		\
+	__young;							\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -57,6 +91,13 @@
 	pte_clear((__mm), (__address), (__ptep));			\
 	__pte;								\
 })
+
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
@@ -88,6 +129,15 @@ do {									\
 	flush_tlb_page(__vma, __address);				\
 	__pte;								\
 })
+
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_SIZE);	\
+	__pmd;								\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -97,10 +147,25 @@ static inline void ptep_set_wrprotect(st
 	pte_t old_pte = *ptep;
 	set_pte_at(mm, address, ptep, pte_wrprotect(old_pte));
 }
+
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+
+#define pmdp_freeze_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mkfreeze(*(__pmdp));				\
+	VM_BUG_ON((__address) & ~HPAGE_MASK);				\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_SIZE);	\
+})
 #endif
 
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
 #endif
 
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -294,6 +294,20 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+	while (TestSetPageCompoundLock(page))
+		while (PageCompoundLock(page))
+			cpu_relax();
+	smp_mb();
+}
+
+static inline void compound_unlock(struct page *page)
+{
+	smp_mb();
+	ClearPageCompoundLock(page);
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
@@ -308,9 +322,14 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+		/* __split_huge_page_refcount can't run under get_page */
+		VM_BUG_ON(!PageTail(page));
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -364,6 +383,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
@@ -804,6 +836,64 @@ int invalidate_inode_page(struct page *p
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+
+extern int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd, pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern int sysctl_transparent_hugepage;
+extern void __split_huge_page_mm(struct mm_struct *mm, unsigned long address,
+				 pmd_t *pmd);
+extern void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd);
+extern int split_huge_page(struct page *page);
+#define split_huge_page_mm(__mm, __addr, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_mm(__mm, __addr, __pmd);	\
+	}  while (0)
+#define split_huge_page_vma(__vma, __pmd)				\
+	do {								\
+		if (unlikely(pmd_trans_huge(*(__pmd))))			\
+			__split_huge_page_vma(__vma, __pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		smp_mb();						\
+		spin_unlock_wait(&(__anon_vma)->lock);			\
+		smp_mb();						\
+		VM_BUG_ON(pmd_trans_frozen(*(__pmd)) ||			\
+			  pmd_trans_huge(*(__pmd)));			\
+	} while (0)
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define sysctl_transparent_hugepage 0
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_mm(__mm, __addr, __pmd)	\
+	do { }  while (0)
+#define split_huge_page_vma(__vma, __pmd)	\
+	do { }  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
@@ -904,7 +994,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -973,12 +1064,14 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,6 +287,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_SIZE);	\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_freeze_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_SIZE);	\
+	pmdp_freeze_flush(___vma, ___address, __pmdp);			\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_freeze_flush_notify pmdp_freeze_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,7 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+	PG_compound_lock,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -239,6 +240,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
 PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
 
+PAGEFLAG(CompoundLock, compound_lock) TESTSETFLAG(CompoundLock, compound_lock)
+
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
@@ -346,7 +349,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -354,6 +357,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -391,6 +401,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(page->flags & PG_head_tail_mask != (1L << PG_compound));
+	ClearPageCompound(page);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -73,6 +73,7 @@ void page_remove_rmap(struct page *);
 
 static inline void page_dup_rmap(struct page *page)
 {
+	VM_BUG_ON(PageTail(page));
 	atomic_inc(&page->_mapcount);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -496,6 +496,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -636,6 +639,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1422,6 +1422,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "transparent_hugepage",
+		.data		= &sysctl_transparent_hugepage,
+		.maxlen		= sizeof(sysctl_transparent_hugepage),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 
 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -290,3 +290,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage support"
+	depends on X86_64
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,3 +45,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,376 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+int sysctl_transparent_hugepage __read_mostly = 1;
+
+static void clear_huge_page(struct page *page, unsigned long addr)
+{
+	int i;
+
+	might_sleep();
+	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + PAGE_SIZE * i);
+	}
+}
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_anonymous_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
+				    unsigned long address, pmd_t *pmd,
+				    struct page *page,
+				    unsigned long haddr)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, address);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr);
+
+	__SetPageUptodate(page);
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+	}
+	spin_unlock(&mm->page_table_lock);
+	
+	return ret;
+}
+
+int do_huge_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+				   __GFP_REPEAT|__GFP_NOWARN,
+				   HPAGE_SHIFT-PAGE_SHIFT);
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_anonymous_page(mm, vma,
+						address, pmd,
+						page, haddr);
+	}
+out:
+	pte = pte_alloc_map(mm, vma, pmd, address);
+	if (!pte)
+		return VM_FAULT_OOM;
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+static void copy_huge_page(struct page *dst_page, struct page *src_page,
+			   unsigned long addr, struct vm_area_struct *vma)
+{
+	int i;
+
+	might_sleep();
+	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+		copy_user_highpage(dst_page + i, src_page + i,
+				   addr + PAGE_SIZE * i, vma);
+		cond_resched();
+	}
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd)))
+		goto out_unlock;
+	if (unlikely(pmd_trans_frozen(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_pgtable(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, anon_rss, 1<<(HPAGE_SHIFT-PAGE_SHIFT));
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL; /* debug */
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0, i;
+	struct page *page, *new_page;
+	unsigned long haddr;
+	struct page **pages;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_pgtable(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
+			      __GFP_REPEAT|__GFP_NOWARN,
+			      HPAGE_SHIFT-PAGE_SHIFT);
+#ifdef CONFIG_DEBUG_VM
+	if (sysctl_transparent_hugepage == -1  && new_page) {
+		put_page(new_page);
+		new_page = NULL;
+	}
+#endif
+	if (unlikely(!new_page)) {
+		pgtable_t pgtable;
+		pmd_t _pmd;
+
+		pages = kzalloc(sizeof(struct page *) *
+				(1<<(HPAGE_SHIFT-PAGE_SHIFT)),
+				GFP_KERNEL);
+		if (unlikely(!pages)) {
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+		
+		for (i = 0; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT); i++) {
+			pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+						  vma, address);
+			if (unlikely(!pages[i])) {
+				while (--i >= 0)
+					put_page(pages[i]);
+				kfree(pages);
+				ret |= VM_FAULT_OOM;
+				goto out;
+			}
+		}
+
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd)))
+			goto out_free_pages;
+		else
+			get_page(page);
+		spin_unlock(&mm->page_table_lock);
+
+		might_sleep();
+		for (i = 0; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT); i++) {
+			copy_user_highpage(pages[i], page + i,
+					   haddr + PAGE_SHIFT*i, vma);
+			__SetPageUptodate(pages[i]);
+			cond_resched();
+		}
+
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd)))
+			goto out_free_pages;
+		else
+			put_page(page);
+
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		/* leave pmd empty until pte is filled */
+
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT);
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			entry = mk_pte(pages[i], vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			page_add_new_anon_rmap(pages[i], vma, haddr);
+			pte = pte_offset_map(&_pmd, haddr);
+			VM_BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+		kfree(pages);
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		spin_unlock(&mm->page_table_lock);
+
+		ret |= VM_FAULT_WRITE;
+		page_remove_rmap(page);
+		put_page(page);
+		goto out;
+	}
+
+	copy_huge_page(new_page, page, haddr, vma);
+	__SetPageUptodate(new_page);
+
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+
+out_free_pages:
+	for (i = 0; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT); i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out_unlock;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_pgtable(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -324,9 +324,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -346,14 +348,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_frozen(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -366,10 +372,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_frozen(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -637,9 +644,9 @@ out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
@@ -699,6 +706,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -895,6 +912,35 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			spin_lock(&tlb->mm->page_table_lock);
+			if (likely(pmd_trans_huge(*pmd))) {
+				if (unlikely(pmd_trans_frozen(*pmd))) {
+					spin_unlock(&tlb->mm->page_table_lock);
+					wait_split_huge_page(vma->anon_vma,
+							     pmd);
+				} else {
+					struct page *page;
+					pgtable_t pgtable;
+					pgtable = get_pmd_huge_pte(tlb->mm);
+					page = pfn_to_page(pmd_pfn(*pmd));
+					VM_BUG_ON(!PageCompound(page));
+					pmd_clear(pmd);
+					spin_unlock(&tlb->mm->page_table_lock);
+					page_remove_rmap(page);
+					VM_BUG_ON(page_mapcount(page) < 0);
+					add_mm_counter(tlb->mm, anon_rss,
+						       -1<<(HPAGE_SHIFT-
+							    PAGE_SHIFT));
+					put_page(page);
+					pte_free(tlb->mm, pgtable);
+					(*zap_work)--;
+					continue;
+				}
+			} else
+				spin_unlock(&tlb->mm->page_table_lock);
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1160,11 +1206,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_frozen(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1273,6 +1335,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -1925,19 +1988,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*
@@ -2926,9 +2976,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3004,7 +3054,23 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	if (pmd_none(*pmd) && sysctl_transparent_hugepage) {
+		if (!vma->vm_ops)
+			return do_huge_anonymous_page(mm, vma, address,
+						      pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_frozen(orig_pmd))
+				return do_huge_wp_page(mm, vma, address,
+						       pmd, orig_pmd);
+			return 0;
+		}
+	}
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
@@ -3144,6 +3210,7 @@ static int follow_pte(struct mm_struct *
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -450,6 +450,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_vma(vma, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -102,6 +102,7 @@ static void remove_migration_pte(struct 
                 return;
 
 	pmd = pmd_offset(pud, addr);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (!pmd_present(*pmd))
 		return;
 
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -95,6 +95,7 @@ static long do_mincore(unsigned long add
 	if (pud_none_or_clear_bad(pud))
 		goto none_mapped;
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_vma(vma, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto none_mapped;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -89,6 +89,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,13 +42,15 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_mm(mm, addr, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -63,7 +65,7 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -148,7 +150,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -310,6 +310,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;
@@ -587,6 +588,8 @@ static void __free_pages_ok(struct page 
 
 	kmemcheck_free_shadow(page, order);
 
+	if (PageAnon(page))
+		page->mapping = NULL;
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
 	if (bad)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -33,6 +33,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_mm(walk->mm, addr, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -55,8 +55,10 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #include "internal.h"
 
@@ -260,6 +262,42 @@ unsigned long page_address_in_vma(struct
 	return vma_address(page, vma);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static pmd_t *__page_check_address_pmd(struct page *page, struct mm_struct *mm,
+				       unsigned long address, int notfrozen)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	VM_BUG_ON(notfrozen == 1 && pmd_trans_frozen(*pmd));
+	if (pmd_trans_huge(*pmd) && pmd_pgtable(*pmd) == page) {
+		VM_BUG_ON(notfrozen == -1 && !pmd_trans_frozen(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+#define page_check_address_pmd(__page, __mm, __address) \
+	__page_check_address_pmd(__page, __mm, __address, 0)
+#define page_check_address_pmd_notfrozen(__page, __mm, __address) \
+	__page_check_address_pmd(__page, __mm, __address, 1)
+#define page_check_address_pmd_frozen(__page, __mm, __address) \
+	__page_check_address_pmd(__page, __mm, __address, -1)
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Check that @page is mapped at @address into @mm.
  *
@@ -344,39 +382,21 @@ static int page_referenced_one(struct pa
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -385,9 +405,42 @@ static int page_referenced_one(struct pa
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageCompound(page))) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address);
+		if (pmd && !pmd_trans_frozen(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+		VM_BUG_ON(1);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 out:
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
@@ -1210,6 +1263,10 @@ int try_to_unmap(struct page *page, enum
 
 	BUG_ON(!PageLocked(page));
 
+	if (unlikely(PageCompound(page)))
+		if (unlikely(split_huge_page(page)))
+			return SWAP_AGAIN;
+
 	if (PageAnon(page))
 		ret = try_to_unmap_anon(page, flags);
 	else
@@ -1243,3 +1300,221 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int __split_huge_page_freeze(struct page *page,
+				    struct vm_area_struct *vma,
+				    unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	if (unlikely(address == -EFAULT))
+		goto out;
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd_notfrozen(page, mm, address);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to freeze it, pmd_huge must remain on at all
+		 * times.
+		 */
+		pmdp_freeze_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+
+	compound_lock(page);
+
+	for (i = 1; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT); i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+		page_tail->index = ++head_index;
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		if (page_evictable(page_tail, NULL))
+			lru_cache_add_lru(page_tail, LRU_ACTIVE_ANON);
+		else
+			add_page_to_unevictable_list(page_tail);
+		put_page(page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	if (unlikely(address == -EFAULT))
+		goto out;
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd_frozen(page, mm, address);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < 1<<(HPAGE_SHIFT-PAGE_SHIFT);
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pgtable);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* must be called with anon_vma->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct vm_area_struct *vma;
+
+	BUG_ON(!PageHead(page));
+
+	mapcount = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+		mapcount += __split_huge_page_freeze(page, vma,
+						     vma_address(page, vma));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+		mapcount2 += __split_huge_page_map(page, vma,
+						   vma_address(page, vma));
+	BUG_ON(mapcount != mapcount2);
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_vma(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct page *page;
+	struct anon_vma *anon_vma;
+	struct mm_struct *mm;
+
+	BUG_ON(vma->vm_flags & VM_HUGETLB);
+
+	mm = vma->vm_mm;
+	BUG_ON(down_write_trylock(&mm->mmap_sem));
+
+	anon_vma = vma->anon_vma;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(pmd_trans_frozen(*pmd));
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		spin_unlock(&anon_vma->lock);
+		return;
+	}
+	page = pmd_pgtable(*pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	__split_huge_page(page, anon_vma);
+
+	spin_unlock(&anon_vma->lock);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
+
+/* must run with mmap_sem to prevent vma to go away */
+void __split_huge_page_mm(struct mm_struct *mm,
+			  unsigned long address,
+			  pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma->vm_start > address);
+	BUG_ON(vma->vm_mm != mm);
+
+	__split_huge_page_vma(vma, pmd);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+ 	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -55,17 +55,80 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_page(page);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock(page_head);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock(page_head);
+			out_put_head:
+				put_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+			else
+				compound_unlock(page_head);
+			return;
+		} else
+			/* page_head is a dangling pointer */
+			goto out_put_single;
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -74,7 +137,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -152,6 +152,10 @@ int add_to_swap(struct page *page)
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageUptodate(page));
 
+	if (unlikely(PageCompound(page)))
+		if (unlikely(split_huge_page(page)))
+			return 0;
+
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -896,6 +896,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
@ 2009-10-27 15:41 ` Rik van Riel
  2009-10-27 18:18 ` Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2009-10-27 15:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On 10/26/2009 02:51 PM, Andrea Arcangeli wrote:
> Hello,
>
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs.

I believe your approach is the right one.

It would be interesting to see how much of a performance gain
is seen with real applications, though from hugetlbfs experience
we already know that some applications can see significant
performance gains from using large pages.

As for the code - this patch is a little too big to comment
on all the details individually, but most of the code looks
good.

It would be nice if some of the code duplication with hugetlbfs
could be removed and the patch could be turned into a series of
more reasonably sized patches before a merge.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
  2009-10-27 15:41 ` Rik van Riel
@ 2009-10-27 18:18 ` Andi Kleen
  2009-10-27 19:30   ` Andrea Arcangeli
  2009-10-27 20:42 ` Christoph Lameter
  2009-10-31 21:29 ` Benjamin Herrenschmidt
  3 siblings, 1 reply; 43+ messages in thread
From: Andi Kleen @ 2009-10-27 18:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

Andrea Arcangeli <aarcange@redhat.com> writes:

In general the best would be to just merge hugetlbfs into
the normal VM. It has been growing for far too long as a separate
"second VM" by now. This seems like a reasonable first step,
but some comments blow.

Haven't looked at the actual code at this point.

> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach

The problem is that this will interact badly with 1GB pages -- once
you split them up you'll never get them back, because they 
can't be allocated at runtime.

Even for 2MB pages it can be a problem.

You'll likely need to fix the page table code.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 20:42 ` Christoph Lameter
@ 2009-10-27 18:21   ` Andrea Arcangeli
  2009-10-27 20:25     ` Chris Wright
  2009-10-29 18:55     ` Christoph Lameter
  0 siblings, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-27 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, Oct 27, 2009 at 04:42:39PM -0400, Christoph Lameter wrote:
> > 1) hugepages have to be swappable or the guest physical memory remains
> >    locked in RAM and can't be paged out to swap
> 
> Thats not such a big issue IMHO. Paging is not necessary. Swapping is
> deadly to many performance based loads. You would abort a job anyways that

Yes, swapping is deadly to performance based loads and it should be
avoided as much as possible, but it's not nice when in order to get a
boost in guest performance when the host isn't low on memory, you lose
the ability to swap when the host is low on memory and all VM are
locked in memory like in inferior-design virtual machines that won't
ever support paging. When system starts swapping the manager can
migrate the VM to other hosts with more memory free to restore the
full RAM performance as soon as possible. Overcommit can be very
useful at maxing out RAM utilization, just like it happens for regular
linux tasks (few people runs with overcommit = 2 for this very
reason.. besides overcommit = 2 includes swap in its equation so you
can still max out ram by adding more free swap).

> it going to swap. On the other hand I wish we would have migration support
> (which may be contingent on swap support).

Agreed, migration is important on numa systems as much as swapping is
important on regular hosts, and this patch allows both in the very
same way with a few liner addition (that is a noop and doesn't modify
the kernel binary when CONFIG_TRANSPARENT_HUGEPAGE=N). The hugepages
in this patch should already relocatable just fine with move_pages (I
say "should" because I didn't test move_pages yet ;).

> > 2) if a hugepage allocation fails, regular pages should be allocated
> >    instead and mixed in the same vma without any failure and without
> >    userland noticing
> 
> Wont you be running into issues with page dirtying on that level?

Not sure I follow what the problem should be. At the moment when
pmd_trans_huge is true, the dirty bit is meaningless (hugepages at the
moment are splitted in place into regular pages before they can be
converted to swapcache, only after an hugepage becomes swapcache its
dirty bit on the pte becomes meaningful to handle the case of an
exclusive swapcache mapped writeable into a single pte and marked
clean to be able to swap it out at zerocost if memory pressure returns
and to avoid a cow if the page is written to before it is paged out
again), but the accessed bit is already handled just fine at the pmd
level.

> > 3) if some task quits and more hugepages become available in the
> >    buddy, guest physical memory backed by regular pages should be
> >    relocated on hugepages automatically in regions under
> >    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
> >    kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
> >    null)
> 
> Oww. This sounds like a heuristic page promotion demotion scheme.
> http://www.cs.rice.edu/~jnavarro/superpages/
> We have discussed this a couple of times and there was a strong feeling
> that the heuristics are bad. But that may no longer be the case since we
> already have stuff like KSM in the kernel. Memory management may get very
> complex in the future.

The good thing is, all real complexity is in the patch I posted. That
solves the locking and the handling of hugepages in regular vmas. The
complexity of the collapse_huge_page daemon that will scan the
MADV_HUGEPAGE registered mappings and relocate regular pages into
hugepages whenever hugepages become available in the buddy, will be
_self_contained_. So it'll be additional complex code yes, but it will
be self contained in huge_memory.c and it won't make the VM any more
complex than this patch already does.

Plus the daemon will be off by default, just like kksmd has to be off
by default at boot...

If you run linux purely as hypervisor it's ok to spend some CPU to
make sure all 2M pages that become available are immediately going to
replace fragmented pages so that the NPT pagetables becomes 3level
instead of 4levels and guest immediately runs faster.

> > The most important design choice is: always fallback to 4k allocation
> > if the hugepage allocation fails! This is the _very_ opposite of some
> > large pagecache patches that failed with -EIO back then if a 64k (or
> > similar) allocation failed...
> 
> Those also had fall back logic to 4k. Does this scheme also allow I/O with

Well maybe I remember your patches wrong, or I might have not followed
later developments but I was quite sure to remember when we discussed
it, the reason of the -EIO failure was the fs had softblocksize bigger
than 4k... and in general fs can't handle blocksize bigger than the
PAGE_CACHE_SIZE... In effect the core trouble wasnt' the large
pagecache but the fact the fs wanted a blocksize larger than
PAGE_SIZE, despite not being able to handle it, if the block was
splitted in multiple 4k not contiguous areas.

> Hugepages through the VFS layer?

Hugepage right now can only be transparently mapped and
swapped/splitted in anon mappings, not in file mappings (not even the
MAP_PRIVATE ones that generate anonymous cache with the COW). This is
to keep it simple. Also keep in mind this is motivated by KVM needing
to run faster like other hypervisors that support hugepages. We
already can handle hugepages to get the hardware boost, but we want
our guests to run as fast as possible _always_ (not only if hugepages
are reserved at boot to avoid memory failure at runtime, or if the
user is not ok to swap, and we don't want to lose the other features
of regular mappings including migration, plus we want the regular
pages to be collapsed in hugepages when they become available). The
whole guest physical memory is mapped by anonymous vmas, so it is
natural to start from there... It's also orders of magnitude simpler
to start from there than to address pagecache ;). Nothing will prevent
to extend this logic to pagecache later...

> > Second important decision (to reduce the impact of the feature on the
> > existing pagetable handling code) is that at any time we can split an
> > hugepage into 512 regular pages and it has to be done with an
> > operation that can't fail. This way the reliability of the swapping
> > isn't decreased (no need to allocate memory when we are short on
> > memory to swap) and it's trivial to plug a split_huge_page* one-liner
> > where needed without polluting the VM. Over time we can teach
> > mprotect, mremap and friends to handle pmd_trans_huge natively without
> > calling split_huge_page*. The fact it can't fail isn't just for swap:
> > if split_huge_page would return -ENOMEM (instead of the current void)
> > we'd need to rollback the mprotect from the middle of it (ideally
> > including undoing the split_vma) which would be a big change and in
> > the very wrong direction (it'd likely be simpler not to call
> > split_huge_page at all and to teach mprotect and friends to handle
> > hugepages instead of rolling them back from the middle). In short the
> > very value of split_huge_page is that it can't fail.
> 
> I dont get the point of this. What do you mean by "an operation that
> cannot fail"? Atomic section?

In short I mean it cannot return -ENOMEM (and an additional bonus is
that I managed it not to require scheduling or blocking
operations). The idea is that you can plug it anywhere with a one
liner and your code becomes hugepage compatible (sure it would run
faster if you were to teach to your code to handle pmd_trans_huge
natively but we can't do it all at once :).

> > The default I like is that transparent hugepages are used at page
> > fault time if they're available in O(1) in the buddy. This can be
> > disabled via sysctl/sysfs setting the value to 0, and if it is
> 
> The consequence of this could be a vast waste of memory if you f.e. touch
> memory only in 1 megabyte increments.

Sure, this is the feature... But if somebody does mmap(2M) supposedly
he's not only going to touch 4k, or I'd blame on the app and not on
the kernel that tries to make that 2M mapping so much faster both at
page fault time (hugely faster ;) and later during random access too.

Now it may very well be the default should be disabled, but I really
doubt with any regular workstation anybody wants it off by
default. Surely embedded should turn it off, and stick to madvise for
their regions (libhugetlbfs will become a bit simpler by only having
to run madvise after mmap) to be sure not to waste any precious kbyte.

> Separate the patch into a patchset for easy review.

I'll try yes...

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 18:18 ` Andi Kleen
@ 2009-10-27 19:30   ` Andrea Arcangeli
  2009-10-28  4:28     ` Andi Kleen
  2009-10-29 12:54     ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-27 19:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, Oct 27, 2009 at 07:18:26PM +0100, Andi Kleen wrote:
> In general the best would be to just merge hugetlbfs into
> the normal VM. It has been growing for far too long as a separate
> "second VM" by now. This seems like a reasonable first step,
> but some comments blow.

Problem is hugetlbfs as it stands now can't be merged... it
deliberately takes its own paths and it tries to be as far away from
the VM as possible. But as you said, as people tries to make hugetlbfs
vmas "more similar to the regular vmas" hugetlbfs slowly spreads into
the VM code defeating the whole reason why hugetlbfs magic exists
(i.e. to be out of the way of the VM as much as possible). Trying to
make hugetlbfs more similar to regular vmas makes the VM more complex
while still not achieving full feature for hugetlbfs (notably
overcommit and paging).

> The problem is that this will interact badly with 1GB pages -- once
> you split them up you'll never get them back, because they 
> can't be allocated at runtime.

1GB pages can't be handled by this code, and clearly it's not
practical to hope 1G pages to materialize in the buddy (even if we
were to increase the buddy so much slowing it down regular page
allocation). Let's forget 1G pages here... we're only focused on sizes
that can be allocated dynamically. Main problem are the 64k pages or
such that don't fit into a pmd...

> Even for 2MB pages it can be a problem.
> 
> You'll likely need to fix the page table code.

In terms of fragmentation split_huge_page itself won't create
it.. unless it swaps (but then CPU performance is lost on the mapping
anyway). We need to teach mprotect/mremap not to call split_huge_page
true, but not to avoid fragmentation. btw, thinking at fragmentation
generated by mnmap(last4k) I also think I found a minor bug in munmap
if a partial part of the 2M page is unmapped (currently I'm afraid I'm
dropping the whole 2M in that case ;), but it's trivial to
fix... clearly not many apps are truncating 4k off a 2M mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 18:21   ` Andrea Arcangeli
@ 2009-10-27 20:25     ` Chris Wright
  2009-10-29 18:51       ` Christoph Lameter
  2009-10-29 18:55     ` Christoph Lameter
  1 sibling, 1 reply; 43+ messages in thread
From: Chris Wright @ 2009-10-27 20:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

* Andrea Arcangeli (aarcange@redhat.com) wrote:
> On Tue, Oct 27, 2009 at 04:42:39PM -0400, Christoph Lameter wrote:
> > > 1) hugepages have to be swappable or the guest physical memory remains
> > >    locked in RAM and can't be paged out to swap
> > 
> > Thats not such a big issue IMHO. Paging is not necessary. Swapping is
> > deadly to many performance based loads. You would abort a job anyways that
> 
> Yes, swapping is deadly to performance based loads and it should be
> avoided as much as possible, but it's not nice when in order to get a
> boost in guest performance when the host isn't low on memory, you lose
> the ability to swap when the host is low on memory and all VM are
> locked in memory like in inferior-design virtual machines that won't
> ever support paging. When system starts swapping the manager can
> migrate the VM to other hosts with more memory free to restore the
> full RAM performance as soon as possible. Overcommit can be very
> useful at maxing out RAM utilization, just like it happens for regular
> linux tasks (few people runs with overcommit = 2 for this very
> reason.. besides overcommit = 2 includes swap in its equation so you
> can still max out ram by adding more free swap).

It's also needed if something like glibc were to take advantage of it in
a generic manner.

thanks,
-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
  2009-10-27 15:41 ` Rik van Riel
  2009-10-27 18:18 ` Andi Kleen
@ 2009-10-27 20:42 ` Christoph Lameter
  2009-10-27 18:21   ` Andrea Arcangeli
  2009-10-31 21:29 ` Benjamin Herrenschmidt
  3 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2009-10-27 20:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Mon, 26 Oct 2009, Andrea Arcangeli wrote:

> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:

Transparent huge page support is something that would be useful in many
areas. The larger memories grow the more pressing the issue will become.

> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap

Thats not such a big issue IMHO. Paging is not necessary. Swapping is
deadly to many performance based loads. You would abort a job anyways that
it going to swap. On the other hand I wish we would have migration support
(which may be contingent on swap support).

> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing

Wont you be running into issues with page dirtying on that level?

> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
>    null)

Oww. This sounds like a heuristic page promotion demotion scheme.
http://www.cs.rice.edu/~jnavarro/superpages/
We have discussed this a couple of times and there was a strong feeling
that the heuristics are bad. But that may no longer be the case since we
already have stuff like KSM in the kernel. Memory management may get very
complex in the future.

> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...

Those also had fall back logic to 4k. Does this scheme also allow I/O with
Hugepages through the VFS layer?

> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.

I dont get the point of this. What do you mean by "an operation that
cannot fail"? Atomic section?

> The default I like is that transparent hugepages are used at page
> fault time if they're available in O(1) in the buddy. This can be
> disabled via sysctl/sysfs setting the value to 0, and if it is

The consequence of this could be a vast waste of memory if you f.e. touch
memory only in 1 megabyte increments.

Separate the patch into a patchset for easy review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 19:30   ` Andrea Arcangeli
@ 2009-10-28  4:28     ` Andi Kleen
  2009-10-28 12:00       ` Andrea Arcangeli
  2009-10-29  9:43         ` Ingo Molnar
  2009-10-29 12:54     ` Andrea Arcangeli
  1 sibling, 2 replies; 43+ messages in thread
From: Andi Kleen @ 2009-10-28  4:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, Oct 27, 2009 at 08:30:07PM +0100, Andrea Arcangeli wrote:

Hi Andrea,

> On Tue, Oct 27, 2009 at 07:18:26PM +0100, Andi Kleen wrote:
> > In general the best would be to just merge hugetlbfs into
> > the normal VM. It has been growing for far too long as a separate
> > "second VM" by now. This seems like a reasonable first step,
> > but some comments blow.
> 
> Problem is hugetlbfs as it stands now can't be merged... it
> deliberately takes its own paths and it tries to be as far away from
> the VM as possible. But as you said, as people tries to make hugetlbfs

I think longer term the standard VM just needs to understand
huge pages properly. Originally when huge pages were only
considered a "Oracle hack" the separation made sense, but now
with more and more use that is really not true anymore.

Also hugetlbfs is gaining more and more functionality all the time.

Maintaining two VMs in parallel forever seems like the wrong
thing to do.

Also the fragmentation avoidance heuristics got a lot better
in the last years, so it's much more practical than it used to be
(at least for 2MB)

> > The problem is that this will interact badly with 1GB pages -- once
> > you split them up you'll never get them back, because they 
> > can't be allocated at runtime.
> 
> 1GB pages can't be handled by this code, and clearly it's not
> practical to hope 1G pages to materialize in the buddy (even if we

That seems short sightened. You do this because 2MB pages give you
x% performance advantage, but then it's likely that 1GB pages will give 
another y% improvement and why should people stop at the smaller
improvement?

Ignoring the gigantic pages now would just mean that this
would need to be revised later again or that users still
need to use hacks like libhugetlbfs.

Given 1GB pages for a time are harder to use on the system
administrator level, but at least for applications the interfaces
should be similar at least.

> were to increase the buddy so much slowing it down regular page
> allocation). Let's forget 1G pages here... we're only focused on sizes
> that can be allocated dynamically. Main problem are the 64k pages or
> such that don't fit into a pmd...

What 64k pages? You're talking about soft pages or non x86?
> 
> > Even for 2MB pages it can be a problem.
> > 
> > You'll likely need to fix the page table code.
> 
> In terms of fragmentation split_huge_page itself won't create
> it.. unless it swaps (but then CPU performance is lost on the mapping
> anyway).

The problem is that the performance will be lost forever. So if
you ever do something that only does a little temporary 
swapping (like a backup run) you would be ready for a reboot.
Not good.

>  We need to teach mprotect/mremap not to call split_huge_page
> true, but not to avoid fragmentation. btw, thinking at fragmentation

I think they just have to be fixed properly.

My suspicion is btw that there's some more code sharing possible
in all that VMA handling code of ther different system calls
(I remember thinking that when I wrote mbind() :-). Then perhaps 
variable page support would be easier anyways because less code needs
to be changed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28  4:28     ` Andi Kleen
@ 2009-10-28 12:00       ` Andrea Arcangeli
  2009-10-28 14:18         ` Andi Kleen
  2009-10-29  9:43         ` Ingo Molnar
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 12:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

Hi Andi,

On Wed, Oct 28, 2009 at 05:28:05AM +0100, Andi Kleen wrote:
> I think longer term the standard VM just needs to understand
> huge pages properly. Originally when huge pages were only
> considered a "Oracle hack" the separation made sense, but now
> with more and more use that is really not true anymore.
> 
> Also hugetlbfs is gaining more and more functionality all the time.

This is exactly the problem... and these days there's not just Oracle:
KVM, glibc needs hugepages too all the time, so it's not surprising
even hugetlbfs is gaining more generic functionality but it's still an
awkward model with libhugetlbfs and it still has limitations that
prevents generic use.

> Maintaining two VMs in parallel forever seems like the wrong
> thing to do.

Agreed.

> Also the fragmentation avoidance heuristics got a lot better
> in the last years, so it's much more practical than it used to be
> (at least for 2MB)

"more practical at least for 2MB" is why I suggested to ignore sizes
like 1G. 2M is already on the largy side, and in this work we should
focus on those sizes that can realistically be found on the buddy even
by pure luck on a busy system sometime, and that can be reasonably
found on the buddy if defrag heuristics shrink the cache in physical
order. 1G will never be found by pure luck, at best it could be
defragged but with an huge expensive defrag relocation amount of work
which might not be justified.

> > > The problem is that this will interact badly with 1GB pages -- once
> > > you split them up you'll never get them back, because they 
> > > can't be allocated at runtime.
> > 
> > 1GB pages can't be handled by this code, and clearly it's not
> > practical to hope 1G pages to materialize in the buddy (even if we
> 
> That seems short sightened. You do this because 2MB pages give you
> x% performance advantage, but then it's likely that 1GB pages will give 
> another y% improvement and why should people stop at the smaller
> improvement?

For the reason mentioned above, sizes like 1G will likely remain
available only through boot-reservation and splitting an huge page to
be swapped would require a very expensive split_huge_page
function. Instead of a loop for (i=0; i<512; i++) inside a not
preemptive section with all pmd frozen, you will have a loop of
262144... And swapping 1G page natively without splitting it, is even
less feasible.

> Ignoring the gigantic pages now would just mean that this
> would need to be revised later again or that users still
> need to use hacks like libhugetlbfs.

They will still need if they want the extra y%, because 1G pages
simply can't be generated by the buddy allocator. I doubt we should
increase the MAX_ORDER from 11 to 18, it would slowdown the whole
buddy without actually giving us 1G pages in a timely manner (the
relocation work over 1G would be very expensive so not suitable for
transparent behavior).

> Given 1GB pages for a time are harder to use on the system
> administrator level, but at least for applications the interfaces
> should be similar at least.

I see your point here in wanting to use the generic interface, we
could have the page fault in the madvise vmas that fits a 1G naturally
aligned region, search into a reserved region first, and if they don't
find the 1G page reserved they could search the buddy for 2M
pages. But still the problem is there's no way to swap that 1G beast
if we go low on memory. It's not transparent behavior, but we could
share the same madvise interface, true! I doubt we should map 1G pages
in tasks outside of madvised vmas, because of that ram being special
and reserved. In some ways hugetlbfs providing for permissions in
order to take advantage of the reserved regions is better than
unprivileged madvise. Not to tell the problem of clearing and copying
of a 1G page during page fault...

For the transparent 2M pages instead if an unprivileged user end up
using them the whole system gains because the more people uses 2M
pages the less fragmentation there is in the system.

But if we'll get to a point where 1G page is feasible (or we want to
obsolete hugetlbfs, which I doubt it will happen until we move
transparent hugepages to tmpfs too), we can always add a
pud_trans_huge later... Frankly the 1G pages don't worry me at all for
the long term. Especially if we'll just manage them with a generic
madvise(MADV_HUGEPAGE). I don't plan to nuke hugetlbfs in the very
short term. If we get to a point where hugetlbfs has no reason to
exist anymore we've just to add pud_trans_huge before nuking it and
have the do_huge_anonymous_page search into the reserved 1G regions if
the VM_HUGEPAGE is set.

> > were to increase the buddy so much slowing it down regular page
> > allocation). Let's forget 1G pages here... we're only focused on sizes
> > that can be allocated dynamically. Main problem are the 64k pages or
> > such that don't fit into a pmd...
> 
> What 64k pages? You're talking about soft pages or non x86?

I wasn't talking about soft pages. The whole patch here is tuned for
hugetlb. I tried to do prefault and to allocate hugepages and map them
partially with ptes (kind of softpages of size a power of 2 between 8k
and 1M both included) to avoid zeroing and copying the whole 2M during
page faults. But it's not worth it. Whenever we deal with hugepages a
huge tlb has always to be involved for this to be worth it. Otherwise
it adds even more complexity and there is not enough gain (with the
exception of speeding up the initial page fault which is not so
important). I think those designs that preallocate hugepages and maps
them partially with ptes are inefficient, overcomplex and bloated.

My worry are the archs like powerpc where a hugepage doesn't fit in a
pmd_trans_huge. I think x86 will fit the pmd/pud_trans_huge approach
in my patch even of 1G pages in the long run, so there is no actual
long term limitation with regard to x86. The fact is that the generic
pagetable code is tuned for x86 so no problem there.

What I am unsure about and worries me more are those archs that don't
use a pmd to map hugepages and to create hugetlb. I am unsure if those
archs will be able to take advantage of my patch with minor changes to
it given it is wired to pmd_trans_huge availability.

> > > Even for 2MB pages it can be a problem.
> > > 
> > > You'll likely need to fix the page table code.
> > 
> > In terms of fragmentation split_huge_page itself won't create
> > it.. unless it swaps (but then CPU performance is lost on the mapping
> > anyway).
> 
> The problem is that the performance will be lost forever. So if
> you ever do something that only does a little temporary 
> swapping (like a backup run) you would be ready for a reboot.
> Not good.

Well until the background daemon calls collapse_huge_page. Also before
splitting the page the pmd young bit is checked and it gets huge more
priority than the young bit of the pte because the pmd young bit has
512 higher probability of being set than the pte young bit.

Also note, the swapping right now generates fragmentation but later we
can add swap entries at the pmd level and we can stop calling
split_huge_page even in the swap path, to avoid swap to introduce
fragmentation. But we can't do everything at once...

> >  We need to teach mprotect/mremap not to call split_huge_page
> > true, but not to avoid fragmentation. btw, thinking at fragmentation
> 
> I think they just have to be fixed properly.

Sure they have in the mid term but just to speedup those syscalls and
to avoid them to break the hugetlb speedup, fragmentation is not an
issue there.

> My suspicion is btw that there's some more code sharing possible
> in all that VMA handling code of ther different system calls
> (I remember thinking that when I wrote mbind() :-). Then perhaps 
> variable page support would be easier anyways because less code needs
> to be changed.

Somebody worked in that direction with pagewalk.c, but one thing is to
do a readonly pagetable walk, another thing is to mangle vmas and
pmd/pte etc... So it looks hard to share there, they split vmas as a
start and then mangles ptes (and in future pmds) around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 12:00       ` Andrea Arcangeli
@ 2009-10-28 14:18         ` Andi Kleen
  2009-10-28 14:54           ` Adam Litke
  2009-10-28 15:48           ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Andi Kleen @ 2009-10-28 14:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 01:00:50PM +0100, Andrea Arcangeli wrote:
> Hi Andi,
> 
> On Wed, Oct 28, 2009 at 05:28:05AM +0100, Andi Kleen wrote:
> > I think longer term the standard VM just needs to understand
> > huge pages properly. Originally when huge pages were only
> > considered a "Oracle hack" the separation made sense, but now
> > with more and more use that is really not true anymore.
> > 
> > Also hugetlbfs is gaining more and more functionality all the time.
> 
> This is exactly the problem... and these days there's not just Oracle:
> KVM, glibc needs hugepages too all the time, so it's not surprising

Why glibc? 

Yes, there are quite some workloads who benefit.

> > Maintaining two VMs in parallel forever seems like the wrong
> > thing to do.
> 
> Agreed.
> 
> > Also the fragmentation avoidance heuristics got a lot better
> > in the last years, so it's much more practical than it used to be
> > (at least for 2MB)
> 
> "more practical at least for 2MB" is why I suggested to ignore sizes
> like 1G. 2M is already on the largy side, and in this work we should

Even without automatic allocation and the need to prereseve
having the same application interface for 1GB pages is still useful.
Otherwise people who want to use the 1GB pages have to do the
special hacks again.

> > x% performance advantage, but then it's likely that 1GB pages will give 
> > another y% improvement and why should people stop at the smaller
> > improvement?
> 
> For the reason mentioned above, sizes like 1G will likely remain
> available only through boot-reservation and splitting an huge page to
> be swapped would require a very expensive split_huge_page
> function. Instead of a loop for (i=0; i<512; i++) inside a not
> preemptive section with all pmd frozen, you will have a loop of
> 262144... And swapping 1G page natively without splitting it, is even
> less feasible.

What I was thinking of was to have a relatively easy to use
flag that allows an application to use prereserved GB pages
transparently. e.g. could be done with a special command

hugepagehint 1GB app

Yes I realize that this is possible to some extend with libhugetlbfs
LD_PRELOAD, but integrating it in the kernel is much saner.

So even if there are some restrictions it would be good to not
ignore the 1GB pages completely.

> 
> > Ignoring the gigantic pages now would just mean that this
> > would need to be revised later again or that users still
> > need to use hacks like libhugetlbfs.
> 
> They will still need if they want the extra y%, because 1G pages
> simply can't be generated by the buddy allocator. I doubt we should
> increase the MAX_ORDER from 11 to 18, it would slowdown the whole

Agreed, prereservation is still the way to go for 1GB.

(although in theory a special allocation could get them without
relying on zone alignment or buddy lists by being not O(1))

> > Given 1GB pages for a time are harder to use on the system
> > administrator level, but at least for applications the interfaces
> > should be similar at least.
> 
> I see your point here in wanting to use the generic interface, we
> could have the page fault in the madvise vmas that fits a 1G naturally
> aligned region, search into a reserved region first, and if they don't
> find the 1G page reserved they could search the buddy for 2M
> pages. But still the problem is there's no way to swap that 1G beast
> if we go low on memory. It's not transparent behavior, but we could

It would need an administrator hint agreed, but if it's just a single
hint per program it would be still "mostly transparent"

> share the same madvise interface, true! I doubt we should map 1G pages
> in tasks outside of madvised vmas, because of that ram being special
> and reserved. In some ways hugetlbfs providing for permissions in

Agreed on not doing it unconditionally, ut the advice could be per
process or per cgroup.

> For the transparent 2M pages instead if an unprivileged user end up
> using them the whole system gains because the more people uses 2M
> pages the less fragmentation there is in the system.

Even on 2MB pages this problem exists to some point: if you explicitely
preallocate 2MB pages to make sure some application can use them
with hugetlbfs you don't want random applications to still the
"guaranteed" huge pages.

So some policy here would be likely needed anyways and the same
could be used for the 1GB pages.

> > > such that don't fit into a pmd...
> > 
> > What 64k pages? You're talking about soft pages or non x86?
> 
> I wasn't talking about soft pages. The whole patch here is tuned for

I was just confused by the 64k number.

> My worry are the archs like powerpc where a hugepage doesn't fit in a
> pmd_trans_huge. I think x86 will fit the pmd/pud_trans_huge approach
> in my patch even of 1G pages in the long run, so there is no actual
> long term limitation with regard to x86. The fact is that the generic
> pagetable code is tuned for x86 so no problem there.
> 
> What I am unsure about and worries me more are those archs that don't
> use a pmd to map hugepages and to create hugetlb. I am unsure if those
> archs will be able to take advantage of my patch with minor changes to
> it given it is wired to pmd_trans_huge availability.

I see. Some archs (like IA64 or POWER?) require special VA address
 ranges for huge pages, for those doing it fully transparent without 
a mmap time flag is likely hard.

> 
> > > > Even for 2MB pages it can be a problem.
> > > > 
> > > > You'll likely need to fix the page table code.
> > > 
> > > In terms of fragmentation split_huge_page itself won't create
> > > it.. unless it swaps (but then CPU performance is lost on the mapping
> > > anyway).
> > 
> > The problem is that the performance will be lost forever. So if
> > you ever do something that only does a little temporary 
> > swapping (like a backup run) you would be ready for a reboot.
> > Not good.
> 
> Well until the background daemon calls collapse_huge_page. Also before
> splitting the page the pmd young bit is checked and it gets huge more
> priority than the young bit of the pte because the pmd young bit has
> 512 higher probability of being set than the pte young bit.
> 
> Also note, the swapping right now generates fragmentation but later we
> can add swap entries at the pmd level and we can stop calling
> split_huge_page even in the swap path, to avoid swap to introduce
> fragmentation. But we can't do everything at once...

I'm still uneasy about this, it's a very clear "glass jaw"
that might well cause serious problems in practice. Anything that requires
regular reboots is bad.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 14:18         ` Andi Kleen
@ 2009-10-28 14:54           ` Adam Litke
  2009-10-28 15:13             ` Andi Kleen
                               ` (2 more replies)
  2009-10-28 15:48           ` Andrea Arcangeli
  1 sibling, 3 replies; 43+ messages in thread
From: Adam Litke @ 2009-10-28 14:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, 2009-10-28 at 15:18 +0100, Andi Kleen wrote:
> > My worry are the archs like powerpc where a hugepage doesn't fit in a
> > pmd_trans_huge. I think x86 will fit the pmd/pud_trans_huge approach
> > in my patch even of 1G pages in the long run, so there is no actual
> > long term limitation with regard to x86. The fact is that the generic
> > pagetable code is tuned for x86 so no problem there.
> > 
> > What I am unsure about and worries me more are those archs that don't
> > use a pmd to map hugepages and to create hugetlb. I am unsure if those
> > archs will be able to take advantage of my patch with minor changes to
> > it given it is wired to pmd_trans_huge availability.
> 
> I see. Some archs (like IA64 or POWER?) require special VA address
>  ranges for huge pages, for those doing it fully transparent without 
> a mmap time flag is likely hard.

PowerPC does not require specific virtual addresses for huge pages, but
does require that a consistent page size be used for each slice of the
virtual address space.  Slices are 256M in size from 0 to 4G and 1TB in
size above 1TB while huge pages are 64k, 16M, or 16G.  Unless the PPC
guys can work some more magic with their mmu, split_huge_page() in its
current form just plain won't work on PowerPC.  That doesn't even take
into account the (already discussed) page table layout differences
between x86 and ppc: http://linux-mm.org/PageTableStructure .

-- 
Thanks,
Adam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 14:54           ` Adam Litke
@ 2009-10-28 15:13             ` Andi Kleen
  2009-10-28 15:30               ` Andrea Arcangeli
  2009-10-29 15:59             ` Dave Hansen
  2009-10-31 21:32             ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 43+ messages in thread
From: Andi Kleen @ 2009-10-28 15:13 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andi Kleen, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

> PowerPC does not require specific virtual addresses for huge pages, but
> does require that a consistent page size be used for each slice of the
> virtual address space.  Slices are 256M in size from 0 to 4G and 1TB in
> size above 1TB while huge pages are 64k, 16M, or 16G.  Unless the PPC
> guys can work some more magic with their mmu, split_huge_page() in its
> current form just plain won't work on PowerPC.  That doesn't even take
> into account the (already discussed) page table layout differences
> between x86 and ppc: http://linux-mm.org/PageTableStructure .

it simply won't be able to use Andrea's transparent code until
someone fixes the MMU. Doesn't seem a disaster

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 15:13             ` Andi Kleen
@ 2009-10-28 15:30               ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 15:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Adam Litke, linux-mm, Marcelo Tosatti, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 04:13:02PM +0100, Andi Kleen wrote:
> it simply won't be able to use Andrea's transparent code until
> someone fixes the MMU. Doesn't seem a disaster

Well at least we found a good reason for hugetlbfs that forces
hugepages on the whole vma to still exist... Even without
split_huge_page (assuming all code would be hugepage aware) the
requirement is that if a hugepage allocation fails we _gracefully_
fallback to 4k allocations and we mix those in the same vma with other
hugepages (so the daemon can collapse the 4k pages into a hugepage
later when they become available).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 14:18         ` Andi Kleen
  2009-10-28 14:54           ` Adam Litke
@ 2009-10-28 15:48           ` Andrea Arcangeli
  2009-10-28 16:03             ` Andi Kleen
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 15:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 03:18:03PM +0100, Andi Kleen wrote:
> Why glibc? 
> Yes, there are quite some workloads who benefit.

That's what I meant, I said glibc to mean not just KVM (like Chris
pointed out before ;)

> Even without automatic allocation and the need to prereseve
> having the same application interface for 1GB pages is still useful.
> Otherwise people who want to use the 1GB pages have to do the
> special hacks again.

They will have to do the special hacks for reservation... No many
other hacks after that if they accept if they reserve it becomes not
swappable. Then it depends how you want to give permissions to use the
reserved areas. It's all a reservation logic that you need in order to
use 1G pages with this.

> What I was thinking of was to have a relatively easy to use
> flag that allows an application to use prereserved GB pages
> transparently. e.g. could be done with a special command
> 
> hugepagehint 1GB app
> 
> Yes I realize that this is possible to some extend with libhugetlbfs
> LD_PRELOAD, but integrating it in the kernel is much saner.
> 
> So even if there are some restrictions it would be good to not
> ignore the 1GB pages completely.

I think we should ignore them in the first round of patches, knowing
this model can fit them later if we just add a reservation logic and
all pud_trans_huge. I don't think we need to provide this immediately
as it'd grow the size of the patch, but we can do it soon after. I'm
frightened by growing the patch even more, I'd rather try to get
optimal on 2M pages and only later worry about 1G pages. I think it's
higher priority to remove a couple of split_huge_page than to support
transparent gigapages given they won't be really transparent anyway.

> Agreed, prereservation is still the way to go for 1GB.

To support gigapages, would require to decide a reservation API
now. After that, the kernel will map a 1G page if it is available and
we add pud_trans_huge all over the place. There are more urgent things
like the collapse daemon, removing a couple of split_huge_page, before
I can worry about reservation APIs and to bloat further with
pud_trans_huge all over the place.

> Agreed on not doing it unconditionally, ut the advice could be per
> process or per cgroup.

It gets more and more complicated and this "hint" is all about
reservation, not something we want to deal with with 2M pages.

> Even on 2MB pages this problem exists to some point: if you explicitely
> preallocate 2MB pages to make sure some application can use them
> with hugetlbfs you don't want random applications to still the
> "guaranteed" huge pages.

This is what the sysctl is about. You can turn it off the
transparency, and then the kernel will keep mapping hugepages only
inside madvise(MADV_HUGEPAGE). There is no need of reserving anything
here.

> So some policy here would be likely needed anyways and the same
> could be used for the 1GB pages.

1GB pages can't use the same logic but again I don't think we will be
doing any additional work, if we address 2M pages now transparent, and
we lave the reservation required for 1G pages for later.

What I mean with ignore, is not to add a requirement for merging that
1G pages are also supported or we've to add even more logics that are
absolutely useless for 2M pages.

> I'm still uneasy about this, it's a very clear "glass jaw"
> that might well cause serious problems in practice. Anything that requires
> regular reboots is bad.

Here nothing requires reboot. If you get 2M pages good, otherwise
stick to 4k pages transparently, userland can't know. When some task
quits and 2M page happens we'll just collapse the 4k pages into the
newly generated 2M pages with a background daemon. Over time we can
add more logics to try to minimize fragmentation (obviously slab needs
a front-allocator that tries 2M page allocation first always, there
are many other things we have to do in the defrag front, before we can
worry about the effect of swap that calls split_huge_page). The other
syscalls that calls split_huge_page as said won't fragment anything
physically (with the exception of munmap and madvise_dontneed if used
to truncate an hugepage).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 15:48           ` Andrea Arcangeli
@ 2009-10-28 16:03             ` Andi Kleen
  2009-10-28 16:22               ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Andi Kleen @ 2009-10-28 16:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 04:48:27PM +0100, Andrea Arcangeli wrote:
> > Even without automatic allocation and the need to prereseve
> > having the same application interface for 1GB pages is still useful.
> > Otherwise people who want to use the 1GB pages have to do the
> > special hacks again.
> 
> They will have to do the special hacks for reservation... No many
> other hacks after that if they accept if they reserve it becomes not
> swappable. Then it depends how you want to give permissions to use the
> reserved areas. It's all a reservation logic that you need in order to
> use 1G pages with this.

It's still a big step between just needing reservation and also
hacking the application to use new interfaces.

> > LD_PRELOAD, but integrating it in the kernel is much saner.
> > 
> > So even if there are some restrictions it would be good to not
> > ignore the 1GB pages completely.
> 
> I think we should ignore them in the first round of patches, knowing
> this model can fit them later if we just add a reservation logic and
> all pud_trans_huge. I don't think we need to provide this immediately

The design at least should not preclude them, even if the code
doesn't fully initially. That is why I objected earlier -- the design
doesn't seem to support them.

> This is what the sysctl is about. You can turn it off the
> transparency, and then the kernel will keep mapping hugepages only
> inside madvise(MADV_HUGEPAGE). There is no need of reserving anything
> here.

A global sysctl seems like a quite clumpsy way to do that. I hope
it would be possible to do better even with relatively simple code.

e.g. a per process flag + prctl wouldn't seem to be particularly complicated.

> 
> > So some policy here would be likely needed anyways and the same
> > could be used for the 1GB pages.
> 
> 1GB pages can't use the same logic but again I don't think we will be
> doing any additional work, if we address 2M pages now transparent, and
> we lave the reservation required for 1G pages for later.

If there's a per process "use pre-reservation" policy that logic
could well be shared for 2MB and 1GB.

> What I mean with ignore, is not to add a requirement for merging that
> 1G pages are also supported or we've to add even more logics that are
> absolutely useless for 2M pages.

I don't think there's much (anything?) in 1GB support that's absolutely
useless for 2M. e.g. a flexible reservation policy is certainly not.

> 
> > I'm still uneasy about this, it's a very clear "glass jaw"
> > that might well cause serious problems in practice. Anything that requires
> > regular reboots is bad.
> 
> Here nothing requires reboot. If you get 2M pages good, otherwise

When the performance improvement is visible enough people will
feel the need to reboot and the practical effect will be that
Linux requires reboots for full performance.

We already have this to some extent with the kernel direct mapping
breakup over time, but this would make it much worse.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 16:03             ` Andi Kleen
@ 2009-10-28 16:22               ` Andrea Arcangeli
  2009-10-28 16:34                 ` Andi Kleen
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 16:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 05:03:52PM +0100, Andi Kleen wrote:
> It's still a big step between just needing reservation and also
> hacking the application to use new interfaces.

The word "transparent" is all about "no need of hacking the
application" because "there is no new interface".

I want to keep it as transparent as possible and to defer adding user
visible interfaces (with the exception of MADV_HUGEPAGE equivalent to
MADV_MERGEABLE for the scan daemon) initially. Even MADV_HUGEPAGE
might not be necessary, even the disable/enable global flag may not be
necessary but that is the absolute minimum tuning that seems
useful and so there's not much risk to obsolete it.

> doesn't fully initially. That is why I objected earlier -- the design
> doesn't seem to support them.

I think it supports them once you solve the reservation, your hinting
and you add pud_trans_huge.

> A global sysctl seems like a quite clumpsy way to do that. I hope
> it would be possible to do better even with relatively simple code.

btw, the sysctl has to be moved to sysfs. The same sysfs directory
will also control the background collapse_huge_page daemon.

> e.g. a per process flag + prctl wouldn't seem to be particularly complicated.

You realize we can add those _interfaces_ later _after_ adding
pud_trans_huge. I don't even want to add pud_trans_huge right
now. Adding them now would force us to be sure to get the interface
right. I don't even want to think about it.

Let's defer any not strictly necessary visible user interface for
_later_. Anything 1G pages need can be deferred later.

> If there's a per process "use pre-reservation" policy that logic
> could well be shared for 2MB and 1GB.

We don't want having to reserve. Yes we could reserve but we don't
want to. We want to tell the kernel which regions have to be scanned
to recreate 2M pages with the madvise, but that's about it.

Nothing prevents us to add an interface to reserve later, which
obviously will be mandatory for 1G pages to ever be allocated. It's
not something we need to solve now I think.

> I don't think there's much (anything?) in 1GB support that's absolutely
> useless for 2M. e.g. a flexible reservation policy is certainly not.

I don't see KVM ever using this reservation hint, glibc neither. So
yes, you may have a corner case, but for the actual users of
transparent hugepages it seems entirely useless to me for the long
run. I may be wrong but because this is a new interface, and
transparent hugepages is all about _not_ having to modify the app at
all, we should better focus on ensuring the MADV_HUGEPAGE fits 1G
collapse_huge_page collapsing later (yeah, assuming 1G pages becomes
available and that you can hang all apps using that data for as long
as copy_page(1g)).

The whole point of ignoring 1G pages is that, we know adding
pud_trans_huge later is no problem, and that it'll require userland
changes that we want to defer as it's an orthogonal problem, even if
it might remotely help some corner case using transparent hugepages.

> When the performance improvement is visible enough people will
> feel the need to reboot and the practical effect will be that
> Linux requires reboots for full performance.

So you think the collapse_huge_page daemon will not be enough? How
can't it be enough? If it's not enough it means the defrag logic isn't
smart enough simply. So there's no way anything we do in this patch
can make a difference to avoid or not avoid reboot. In short your
worry of "need of rebooting" has nothing to do with the code we're
discussing but with the ability of the VM to generate hugepages. The
collapse_huge_page daemon will do the necessary things if those are
made available without need of reboot. yes defrag is another thing to
solve but it can be addressed separately and in parallel with this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 16:22               ` Andrea Arcangeli
@ 2009-10-28 16:34                 ` Andi Kleen
  2009-10-28 16:56                   ` Adam Litke
  2009-10-28 19:04                   ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Andi Kleen @ 2009-10-28 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 05:22:06PM +0100, Andrea Arcangeli wrote:
> I want to keep it as transparent as possible and to defer adding user
> visible interfaces (with the exception of MADV_HUGEPAGE equivalent to
> MADV_MERGEABLE for the scan daemon) initially. Even MADV_HUGEPAGE
> might not be necessary, even the disable/enable global flag may not be
> necessary but that is the absolute minimum tuning that seems
> useful and so there's not much risk to obsolete it.

I think you need some user visible interfaces to cleanly handle existing
reservations on a process base at least, otherwise you'll completely break 
their semantics.

sysctls that change existing semantics greatly are usually a bad idea
because what should the user do if they have existing applications
that rely on old semantics, but still want the new functionality?

> > e.g. a per process flag + prctl wouldn't seem to be particularly complicated.
> 
> You realize we can add those _interfaces_ later _after_ adding
> pud_trans_huge. I don't even want to add pud_trans_huge right

If you rely on splitting then it all won't work
for 1GB anyways and might need to be redone on the design level.
Code that's not complete is ok, but code that is known to need a 
redesign from the start is not that great.

Also completely ignoring sane reservation semantics in advance also
doesn't seem to be a particularly good way. Some way to control
this fine grained should be there at least.

> > I don't think there's much (anything?) in 1GB support that's absolutely
> > useless for 2M. e.g. a flexible reservation policy is certainly not.
> 
> I don't see KVM ever using this reservation hint, glibc neither. So

It would be set by the administrator.

> all, we should better focus on ensuring the MADV_HUGEPAGE fits 1G
> collapse_huge_page collapsing later (yeah, assuming 1G pages becomes
> available and that you can hang all apps using that data for as long
> as copy_page(1g)).

Can always schedule and check for signals during the copy.

> > When the performance improvement is visible enough people will
> > feel the need to reboot and the practical effect will be that
> > Linux requires reboots for full performance.
> 
> So you think the collapse_huge_page daemon will not be enough? How
> can't it be enough? If it's not enough it means the defrag logic isn't

I don't know how well it will hold up in practice. Only data can tell.

The problem I have is that the current "split on demand" approach 
can fragment even prereserved pages.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 16:34                 ` Andi Kleen
@ 2009-10-28 16:56                   ` Adam Litke
  2009-10-28 17:18                     ` Andi Kleen
  2009-10-28 19:04                   ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Adam Litke @ 2009-10-28 16:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, 2009-10-28 at 17:34 +0100, Andi Kleen wrote:
> On Wed, Oct 28, 2009 at 05:22:06PM +0100, Andrea Arcangeli wrote:
> > I want to keep it as transparent as possible and to defer adding user
> > visible interfaces (with the exception of MADV_HUGEPAGE equivalent to
> > MADV_MERGEABLE for the scan daemon) initially. Even MADV_HUGEPAGE
> > might not be necessary, even the disable/enable global flag may not be
> > necessary but that is the absolute minimum tuning that seems
> > useful and so there's not much risk to obsolete it.
> 
> I think you need some user visible interfaces to cleanly handle existing
> reservations on a process base at least, otherwise you'll completely break 
> their semantics.

But we already handle explicit hugepages (with page pools and strict
reservations) via hugetlbfs and libhugetlbfs.  It seems you're just
making an argument for keeping these around (which I certainly agree
with).

> sysctls that change existing semantics greatly are usually a bad idea
> because what should the user do if they have existing applications
> that rely on old semantics, but still want the new functionality?

If you want to reserve some huge pages for a specific, corner-case
application, allocate huge pages the way we do today and use
libhugetlbfs.  Meanwhile, the rest of the system can benefit from this
new interface.

-- 
Thanks,
Adam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 16:56                   ` Adam Litke
@ 2009-10-28 17:18                     ` Andi Kleen
  0 siblings, 0 replies; 43+ messages in thread
From: Andi Kleen @ 2009-10-28 17:18 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andi Kleen, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 11:56:55AM -0500, Adam Litke wrote:
> > I think you need some user visible interfaces to cleanly handle existing
> > reservations on a process base at least, otherwise you'll completely break 
> > their semantics.
> 
> But we already handle explicit hugepages (with page pools and strict
> reservations) via hugetlbfs and libhugetlbfs.  It seems you're just
> making an argument for keeping these around (which I certainly agree
> with).

That would require not supporting reservations through the transparent
mechanism. That wouldn't be very nice semantics, because you end up
with "glass jaw" performance always in the transparent case.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 16:34                 ` Andi Kleen
  2009-10-28 16:56                   ` Adam Litke
@ 2009-10-28 19:04                   ` Andrea Arcangeli
  2009-10-28 19:22                     ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 19:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, Oct 28, 2009 at 05:34:58PM +0100, Andi Kleen wrote:
> I think you need some user visible interfaces to cleanly handle existing
> reservations on a process base at least, otherwise you'll completely break 
> their semantics.
> 
> sysctls that change existing semantics greatly are usually a bad idea
> because what should the user do if they have existing applications
> that rely on old semantics, but still want the new functionality?

What is not clear about the word "transparent". This whole effort is
about not having to add visible interfaces and userland won't be able
to notice (except it runs faster). We don't want new interfaces. We
need an madvise to give an hint to the daemon of which regions are
critical to have hugepages. It's not so easy for the kernel to find it
by itself.

The reason the sysfs enable/disable of the "transparency" is because
embedded may want to disable the transparency. Not every hardware out
there will have enough memory or enough l2 CPU cache and useful
workloads to take advantage of this, so those might (and it's not
guaranteed) save a bit of memory by disabling the feature.

In short the fewer new interfaces we add the better, and the only one
I think is generic enough and needed enough, is madvise(MADV_HUGEPAGE)
(which will tell the kernel to use hugepages even if transparent
hugepage is disabled in sysfs and it'll tell the collapse_huge_page
daemon the virtual regions to relocate in hugepages). For the time
being any additional interface would defeat the objective of not
having to modify apps.

> If you rely on splitting then it all won't work
> for 1GB anyways and might need to be redone on the design level.


memory reservation is the first thing we want to remove as requirement
to use hugepages, which is the first reason why 1G won't work anyway
as we don't want reservation in this, this is all about not having to
reserve anything at boot and not having to modify binaries at all.

1G pages can work but it would need to split 512 pieces and we can do
that after my patch will swap natively 2M pages and we won't call
split_huge_page anymore. Then split_huge_page can be moved up one
level to the pud. Something like that.

Worrying about this right now is too early and not worth it so we
better ignore 1G in the transparency area.

> Code that's not complete is ok, but code that is known to need a 
> redesign from the start is not that great.

It won't need any redesign... besides this is only relevant if you can
manage to find 1G page without reservation, otherwise you're better
off with with hugetlbfs if you have to do magics visible to userland
that _entirely_ depends on reservation for them to have a slight
chance to allocate a 1G page.

> Also completely ignoring sane reservation semantics in advance also
> doesn't seem to be a particularly good way. Some way to control
> this fine grained should be there at least.

Eliminating reservation is the first objective of the patch.

> > all, we should better focus on ensuring the MADV_HUGEPAGE fits 1G
> > collapse_huge_page collapsing later (yeah, assuming 1G pages becomes
> > available and that you can hang all apps using that data for as long
> > as copy_page(1g)).
> 
> Can always schedule and check for signals during the copy.

same is true for split_huge_page... if copy_page can work on a 1G page
then we could even split it at the pte level, but frankly I think it
would be a better fit to split the pud at the pmd level only without
having to go down to the pte.

> The problem I have is that the current "split on demand" approach 
> can fragment even prereserved pages.

1) we eliminate reservation (no preserved pages here) 2)
split_huge_page on demand can't generate any fragmentation whatsoever
(only swap code can then fragment the hugepage by swapping only part
of it but you know the swap code can't swap 2M at once, it's not
split_huge_page fault if page is fragmented as it is swapped it, no
fragmentation happens when mprotect and mremap calls split_huge_page,
however we want to optimize those for performance reasons, and
definitely not for fragmentation purposes at all)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 19:04                   ` Andrea Arcangeli
@ 2009-10-28 19:22                     ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-28 19:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

> > Can always schedule and check for signals during the copy.
> 
> same is true for split_huge_page... if copy_page can work on a 1G page
> then we could even split it at the pte level, but frankly I think it
> would be a better fit to split the pud at the pmd level only without
> having to go down to the pte.

Note checking for signals is not enough, and even if it was enough we
would need to rollback from the middle and it's an huge
complexity... If it was so easy copy_huge_page in hugetlb.c would be
doing it too (I mean checking for signals, obviously it
doesn't... :). It already results in livelocks, I just recently had
bugreports about it and not easy to fix (they had not to cow to avoid
the livelock). And they were about 256M pages!!!! Not 1G
pages... ;). 256M already livelocks the application if one has to cow
256M and access 512M (not 2G!!) in a fault. Last but not the least
when cpus will be fast enough to execute copy_page(1G) as fast as it
does now copy_page(2M) we'll be running linux with PAGE_SIZE = 2M in
the first place...

Overall I liked to evaluate the feasibility of pud_trans_huge, and
discuss about it because one can never know somebody may have petabyte
of memory and slow enough cpu to require 4k page, but the notion of
providing transparent gigapages at the time being is absurd no matter
what design or implementation, hugetlbfs is as good as it can be for
it I think.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28  4:28     ` Andi Kleen
@ 2009-10-29  9:43         ` Ingo Molnar
  2009-10-29  9:43         ` Ingo Molnar
  1 sibling, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2009-10-29  9:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > 1GB pages can't be handled by this code, and clearly it's not 
> > practical to hope 1G pages to materialize in the buddy (even if we
> 
> That seems short sightened. You do this because 2MB pages give you x% 
> performance advantage, but then it's likely that 1GB pages will give 
> another y% improvement and why should people stop at the smaller 
> improvement?
> 
> Ignoring the gigantic pages now would just mean that this would need 
> to be revised later again or that users still need to use hacks like 
> libhugetlbfs.

I've read the patch and have read through this discussion and you are 
missing the big point that it's best to do such things gradually - one 
step at a time.

Just like we went from 2 level pagetables to 3 level pagetables, then to 
4 level pagetables - and we might go to 5 level pagetables in the 
future. We didnt go from 2 level pagetables to 5 level page tables in 
one go, despite predictions clearly pointing out the exponentially 
increasing need for RAM.

So your obsession with 1GB pages is misguided. If indeed transparent 
largepages give us real benefits we can extend it to do transparent 
gbpages as well - should we ever want to. There's nothing 'shortsighted' 
about being gradual - the change is already ambitious enough as-is, and 
brings very clear benefits to a difficult, decade-old problem no other 
person was able to address.

In fact introducing transparent 2MBpages makes 1GB pages support 
_easier_ to merge: as at that point we'll already have a (finally..) 
successful hugetlb facility happility used by an increasing range of 
applications.

Hugetlbfs's big problem was always that it wasnt transparent and hence 
wasnt gradual for applications. It was an opt-in and constituted an 
interface/ABI change - that is always a big barrier to app adoption.

So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
in fine detail and added to -mm ASAP. Our lack of decent, automatic 
hugepage support is sticking out like a sore thumb and is hurting us in 
high-performance setups. If largepage support within Linux has a chance, 
this might be the way to do it.

A small comment regarding the patch itself: i think it could be 
simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
making it a natural feature of hugepage support. If the code is correct 
i cannot see any scenario under which i wouldnt want a hugepage enabled 
kernel i'm booting to not have transparent hugepage support as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-10-29  9:43         ` Ingo Molnar
  0 siblings, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2009-10-29  9:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > 1GB pages can't be handled by this code, and clearly it's not 
> > practical to hope 1G pages to materialize in the buddy (even if we
> 
> That seems short sightened. You do this because 2MB pages give you x% 
> performance advantage, but then it's likely that 1GB pages will give 
> another y% improvement and why should people stop at the smaller 
> improvement?
> 
> Ignoring the gigantic pages now would just mean that this would need 
> to be revised later again or that users still need to use hacks like 
> libhugetlbfs.

I've read the patch and have read through this discussion and you are 
missing the big point that it's best to do such things gradually - one 
step at a time.

Just like we went from 2 level pagetables to 3 level pagetables, then to 
4 level pagetables - and we might go to 5 level pagetables in the 
future. We didnt go from 2 level pagetables to 5 level page tables in 
one go, despite predictions clearly pointing out the exponentially 
increasing need for RAM.

So your obsession with 1GB pages is misguided. If indeed transparent 
largepages give us real benefits we can extend it to do transparent 
gbpages as well - should we ever want to. There's nothing 'shortsighted' 
about being gradual - the change is already ambitious enough as-is, and 
brings very clear benefits to a difficult, decade-old problem no other 
person was able to address.

In fact introducing transparent 2MBpages makes 1GB pages support 
_easier_ to merge: as at that point we'll already have a (finally..) 
successful hugetlb facility happility used by an increasing range of 
applications.

Hugetlbfs's big problem was always that it wasnt transparent and hence 
wasnt gradual for applications. It was an opt-in and constituted an 
interface/ABI change - that is always a big barrier to app adoption.

So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
in fine detail and added to -mm ASAP. Our lack of decent, automatic 
hugepage support is sticking out like a sore thumb and is hurting us in 
high-performance setups. If largepage support within Linux has a chance, 
this might be the way to do it.

A small comment regarding the patch itself: i think it could be 
simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
making it a natural feature of hugepage support. If the code is correct 
i cannot see any scenario under which i wouldnt want a hugepage enabled 
kernel i'm booting to not have transparent hugepage support as well.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-29  9:43         ` Ingo Molnar
@ 2009-10-29 10:36           ` Andrea Arcangeli
  -1 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-29 10:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

Hello Ingo, Andi, everyone,

On Thu, Oct 29, 2009 at 10:43:44AM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > 1GB pages can't be handled by this code, and clearly it's not 
> > > practical to hope 1G pages to materialize in the buddy (even if we
> > 
> > That seems short sightened. You do this because 2MB pages give you x% 
> > performance advantage, but then it's likely that 1GB pages will give 
> > another y% improvement and why should people stop at the smaller 
> > improvement?
> > 
> > Ignoring the gigantic pages now would just mean that this would need 
> > to be revised later again or that users still need to use hacks like 
> > libhugetlbfs.
> 
> I've read the patch and have read through this discussion and you are 
> missing the big point that it's best to do such things gradually - one 
> step at a time.
> 
> Just like we went from 2 level pagetables to 3 level pagetables, then to 
> 4 level pagetables - and we might go to 5 level pagetables in the 
> future. We didnt go from 2 level pagetables to 5 level page tables in 
> one go, despite predictions clearly pointing out the exponentially 
> increasing need for RAM.

I totally agree with your assessment.

> So your obsession with 1GB pages is misguided. If indeed transparent 
> largepages give us real benefits we can extend it to do transparent 
> gbpages as well - should we ever want to. There's nothing 'shortsighted' 
> about being gradual - the change is already ambitious enough as-is, and 
> brings very clear benefits to a difficult, decade-old problem no other 
> person was able to address.
> 
> In fact introducing transparent 2MBpages makes 1GB pages support 
> _easier_ to merge: as at that point we'll already have a (finally..) 
> successful hugetlb facility happility used by an increasing range of 
> applications.

Agreed.

> Hugetlbfs's big problem was always that it wasnt transparent and hence 
> wasnt gradual for applications. It was an opt-in and constituted an 
> interface/ABI change - that is always a big barrier to app adoption.
> 
> So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
> in fine detail and added to -mm ASAP. Our lack of decent, automatic 
> hugepage support is sticking out like a sore thumb and is hurting us in 
> high-performance setups. If largepage support within Linux has a chance, 
> this might be the way to do it.

Thanks a lot for your review!

> A small comment regarding the patch itself: i think it could be 
> simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
> making it a natural feature of hugepage support. If the code is correct 
> i cannot see any scenario under which i wouldnt want a hugepage enabled 
> kernel i'm booting to not have transparent hugepage support as well.

The two reasons why I added a config option are:

1) because it was easy enough, gcc is smart enough to eliminate the
external calls so I didn't need to add ifdefs with the exception of
returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
make the exports of huge_memory.c visible unconditionally so it doesn't
warn, after that I don't need to build and link huge_memory.o.

2) to avoid breaking build of archs not implementing pmd_trans_huge
and that may never be able to take advantage of it

But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
to Y on x86-64 and N on power.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-10-29 10:36           ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-29 10:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

Hello Ingo, Andi, everyone,

On Thu, Oct 29, 2009 at 10:43:44AM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > 1GB pages can't be handled by this code, and clearly it's not 
> > > practical to hope 1G pages to materialize in the buddy (even if we
> > 
> > That seems short sightened. You do this because 2MB pages give you x% 
> > performance advantage, but then it's likely that 1GB pages will give 
> > another y% improvement and why should people stop at the smaller 
> > improvement?
> > 
> > Ignoring the gigantic pages now would just mean that this would need 
> > to be revised later again or that users still need to use hacks like 
> > libhugetlbfs.
> 
> I've read the patch and have read through this discussion and you are 
> missing the big point that it's best to do such things gradually - one 
> step at a time.
> 
> Just like we went from 2 level pagetables to 3 level pagetables, then to 
> 4 level pagetables - and we might go to 5 level pagetables in the 
> future. We didnt go from 2 level pagetables to 5 level page tables in 
> one go, despite predictions clearly pointing out the exponentially 
> increasing need for RAM.

I totally agree with your assessment.

> So your obsession with 1GB pages is misguided. If indeed transparent 
> largepages give us real benefits we can extend it to do transparent 
> gbpages as well - should we ever want to. There's nothing 'shortsighted' 
> about being gradual - the change is already ambitious enough as-is, and 
> brings very clear benefits to a difficult, decade-old problem no other 
> person was able to address.
> 
> In fact introducing transparent 2MBpages makes 1GB pages support 
> _easier_ to merge: as at that point we'll already have a (finally..) 
> successful hugetlb facility happility used by an increasing range of 
> applications.

Agreed.

> Hugetlbfs's big problem was always that it wasnt transparent and hence 
> wasnt gradual for applications. It was an opt-in and constituted an 
> interface/ABI change - that is always a big barrier to app adoption.
> 
> So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
> in fine detail and added to -mm ASAP. Our lack of decent, automatic 
> hugepage support is sticking out like a sore thumb and is hurting us in 
> high-performance setups. If largepage support within Linux has a chance, 
> this might be the way to do it.

Thanks a lot for your review!

> A small comment regarding the patch itself: i think it could be 
> simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
> making it a natural feature of hugepage support. If the code is correct 
> i cannot see any scenario under which i wouldnt want a hugepage enabled 
> kernel i'm booting to not have transparent hugepage support as well.

The two reasons why I added a config option are:

1) because it was easy enough, gcc is smart enough to eliminate the
external calls so I didn't need to add ifdefs with the exception of
returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
make the exports of huge_memory.c visible unconditionally so it doesn't
warn, after that I don't need to build and link huge_memory.o.

2) to avoid breaking build of archs not implementing pmd_trans_huge
and that may never be able to take advantage of it

But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
to Y on x86-64 and N on power.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 19:30   ` Andrea Arcangeli
  2009-10-28  4:28     ` Andi Kleen
@ 2009-10-29 12:54     ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-10-29 12:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, Oct 27, 2009 at 08:30:07PM +0100, Andrea Arcangeli wrote:
> generated by mnmap(last4k) I also think I found a minor bug in munmap
> if a partial part of the 2M page is unmapped (currently I'm afraid I'm

Here the incremental fix if somebody is running the patch (this also
frees the page after tlb flush, during development I had to call
put_page to debug something and I forgot to replace it back to
tlb_remove_page ;). In the meantime I'll try to split the patch into
more self contained pieces for easier review (including the kvm patch
to build 3level NPT/EPT).

Then I will continue working on the daemon that will
collapse_huge_page in the madvise(MADV_HUGEPAGE) regions.

This printed 0 before, now it prints 0xff as with transparent hugepage
disabled.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>

#define SIZE (2*1024*1024)

int main()
{
	char *p = malloc(SIZE*2-1);
	p = (char *)((unsigned long)(p + SIZE - 1) & ~(SIZE-1));
	*p = 0xff;
	munmap(p+SIZE-4096, 4096);
	printf("%x\n", *(unsigned char *)p);

	return 0;
}


Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -850,6 +850,9 @@ extern struct page *follow_trans_huge_pm
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int zap_pmd_trans_huge(struct mmu_gather *tlb,
+			      struct vm_area_struct *vma,
+			      pmd_t *pmd);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			  pmd_t *dst_pmd, pmd_t *src_pmd,
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -292,8 +292,9 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  See Documentation/nommu-mmap.txt for more information.
 
 config TRANSPARENT_HUGEPAGE
-	bool "Transparent Hugepage support"
+	bool "Transparent Hugepage support" if EMBEDDED
 	depends on X86_64
+	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
 	  huge tlb transparently to the applications whenever possible.
@@ -302,4 +303,4 @@ config TRANSPARENT_HUGEPAGE
 	  allocation, by reducing the number of tlb misses and by speeding
 	  up the pagetable walking.
 
-	  If unsure, say N.
+	  If memory constrained on embedded, you may want to say N.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,6 +12,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
@@ -374,3 +375,37 @@ struct page *follow_trans_huge_pmd(struc
 out:
 	return page;
 }
+
+int zap_pmd_trans_huge(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		       pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_frozen(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pfn_to_page(pmd_pfn(*pmd));
+			VM_BUG_ON(!PageCompound(page));
+			pmd_clear(pmd);
+			spin_unlock(&tlb->mm->page_table_lock);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, anon_rss,
+				       -1<<(HPAGE_SHIFT-
+					    PAGE_SHIFT));
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -913,32 +913,12 @@ static inline unsigned long zap_pmd_rang
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
-			spin_lock(&tlb->mm->page_table_lock);
-			if (likely(pmd_trans_huge(*pmd))) {
-				if (unlikely(pmd_trans_frozen(*pmd))) {
-					spin_unlock(&tlb->mm->page_table_lock);
-					wait_split_huge_page(vma->anon_vma,
-							     pmd);
-				} else {
-					struct page *page;
-					pgtable_t pgtable;
-					pgtable = get_pmd_huge_pte(tlb->mm);
-					page = pfn_to_page(pmd_pfn(*pmd));
-					VM_BUG_ON(!PageCompound(page));
-					pmd_clear(pmd);
-					spin_unlock(&tlb->mm->page_table_lock);
-					page_remove_rmap(page);
-					VM_BUG_ON(page_mapcount(page) < 0);
-					add_mm_counter(tlb->mm, anon_rss,
-						       -1<<(HPAGE_SHIFT-
-							    PAGE_SHIFT));
-					put_page(page);
-					pte_free(tlb->mm, pgtable);
-					(*zap_work)--;
-					continue;
-				}
-			} else
-				spin_unlock(&tlb->mm->page_table_lock);
+			if (next-addr != HPAGE_SIZE)
+				split_huge_page_vma(vma, pmd);
+			else if (zap_pmd_trans_huge(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(pmd)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 14:54           ` Adam Litke
  2009-10-28 15:13             ` Andi Kleen
@ 2009-10-29 15:59             ` Dave Hansen
  2009-10-31 21:32             ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2009-10-29 15:59 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andi Kleen, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, 2009-10-28 at 09:54 -0500, Adam Litke wrote:
> PowerPC does not require specific virtual addresses for huge pages, but
> does require that a consistent page size be used for each slice of the
> virtual address space.  Slices are 256M in size from 0 to 4G and 1TB in
> size above 1TB while huge pages are 64k, 16M, or 16G.  Unless the PPC
> guys can work some more magic with their mmu, split_huge_page() in its
> current form just plain won't work on PowerPC.

One answer, at least in the beginning, would be to just ignore this
detail.  Try to make 16MB pages wherever possible, probably even as 16MB
pages in the Linux pagetables.  But, we can't promote the MMU to use
them until get get a 256MB or 1TB chunk.  It will definitely mean some
ppc-specific bits when we're changing the segment mapping size, but it's
not impossible.

That's not going to do any good for the desktop-type users.  But, it
should be just fine for the HPC or JVM folks.  It restricts the users
pretty severely, but it gives us *something*.

There will be some benefit to using a 16MB Linux page and pte even if we
can't back it with 16MB MMU pages, anyway.  Remember, a big chunk of the
benefit of using 64k pages can be seen even on systems with no 64k
hardware pages.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-29 10:36           ` Andrea Arcangeli
@ 2009-10-29 16:50             ` Mike Travis
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Travis @ 2009-10-29 16:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel, Karl Feind, Jack Steiner

Hi Andrea,

I will find some time soon to test out your patch on a
(relatively) huge machine and let you know the results.

The memory size on this machine:

	480,700,399,616 bytes of system memory tested OK

This translates to ~240k available 2Mb pages.

Thanks,
Mike

Andrea Arcangeli wrote:
> Hello Ingo, Andi, everyone,
> 
> On Thu, Oct 29, 2009 at 10:43:44AM +0100, Ingo Molnar wrote:
>> * Andi Kleen <andi@firstfloor.org> wrote:
>>
>>>> 1GB pages can't be handled by this code, and clearly it's not 
>>>> practical to hope 1G pages to materialize in the buddy (even if we
>>> That seems short sightened. You do this because 2MB pages give you x% 
>>> performance advantage, but then it's likely that 1GB pages will give 
>>> another y% improvement and why should people stop at the smaller 
>>> improvement?
>>>
>>> Ignoring the gigantic pages now would just mean that this would need 
>>> to be revised later again or that users still need to use hacks like 
>>> libhugetlbfs.
>> I've read the patch and have read through this discussion and you are 
>> missing the big point that it's best to do such things gradually - one 
>> step at a time.
>>
>> Just like we went from 2 level pagetables to 3 level pagetables, then to 
>> 4 level pagetables - and we might go to 5 level pagetables in the 
>> future. We didnt go from 2 level pagetables to 5 level page tables in 
>> one go, despite predictions clearly pointing out the exponentially 
>> increasing need for RAM.
> 
> I totally agree with your assessment.
> 
>> So your obsession with 1GB pages is misguided. If indeed transparent 
>> largepages give us real benefits we can extend it to do transparent 
>> gbpages as well - should we ever want to. There's nothing 'shortsighted' 
>> about being gradual - the change is already ambitious enough as-is, and 
>> brings very clear benefits to a difficult, decade-old problem no other 
>> person was able to address.
>>
>> In fact introducing transparent 2MBpages makes 1GB pages support 
>> _easier_ to merge: as at that point we'll already have a (finally..) 
>> successful hugetlb facility happility used by an increasing range of 
>> applications.
> 
> Agreed.
> 
>> Hugetlbfs's big problem was always that it wasnt transparent and hence 
>> wasnt gradual for applications. It was an opt-in and constituted an 
>> interface/ABI change - that is always a big barrier to app adoption.
>>
>> So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
>> in fine detail and added to -mm ASAP. Our lack of decent, automatic 
>> hugepage support is sticking out like a sore thumb and is hurting us in 
>> high-performance setups. If largepage support within Linux has a chance, 
>> this might be the way to do it.
> 
> Thanks a lot for your review!
> 
>> A small comment regarding the patch itself: i think it could be 
>> simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
>> making it a natural feature of hugepage support. If the code is correct 
>> i cannot see any scenario under which i wouldnt want a hugepage enabled 
>> kernel i'm booting to not have transparent hugepage support as well.
> 
> The two reasons why I added a config option are:
> 
> 1) because it was easy enough, gcc is smart enough to eliminate the
> external calls so I didn't need to add ifdefs with the exception of
> returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
> make the exports of huge_memory.c visible unconditionally so it doesn't
> warn, after that I don't need to build and link huge_memory.o.
> 
> 2) to avoid breaking build of archs not implementing pmd_trans_huge
> and that may never be able to take advantage of it
> 
> But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
> to Y on x86-64 and N on power.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-10-29 16:50             ` Mike Travis
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Travis @ 2009-10-29 16:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel, Karl Feind, Jack Steiner

Hi Andrea,

I will find some time soon to test out your patch on a
(relatively) huge machine and let you know the results.

The memory size on this machine:

	480,700,399,616 bytes of system memory tested OK

This translates to ~240k available 2Mb pages.

Thanks,
Mike

Andrea Arcangeli wrote:
> Hello Ingo, Andi, everyone,
> 
> On Thu, Oct 29, 2009 at 10:43:44AM +0100, Ingo Molnar wrote:
>> * Andi Kleen <andi@firstfloor.org> wrote:
>>
>>>> 1GB pages can't be handled by this code, and clearly it's not 
>>>> practical to hope 1G pages to materialize in the buddy (even if we
>>> That seems short sightened. You do this because 2MB pages give you x% 
>>> performance advantage, but then it's likely that 1GB pages will give 
>>> another y% improvement and why should people stop at the smaller 
>>> improvement?
>>>
>>> Ignoring the gigantic pages now would just mean that this would need 
>>> to be revised later again or that users still need to use hacks like 
>>> libhugetlbfs.
>> I've read the patch and have read through this discussion and you are 
>> missing the big point that it's best to do such things gradually - one 
>> step at a time.
>>
>> Just like we went from 2 level pagetables to 3 level pagetables, then to 
>> 4 level pagetables - and we might go to 5 level pagetables in the 
>> future. We didnt go from 2 level pagetables to 5 level page tables in 
>> one go, despite predictions clearly pointing out the exponentially 
>> increasing need for RAM.
> 
> I totally agree with your assessment.
> 
>> So your obsession with 1GB pages is misguided. If indeed transparent 
>> largepages give us real benefits we can extend it to do transparent 
>> gbpages as well - should we ever want to. There's nothing 'shortsighted' 
>> about being gradual - the change is already ambitious enough as-is, and 
>> brings very clear benefits to a difficult, decade-old problem no other 
>> person was able to address.
>>
>> In fact introducing transparent 2MBpages makes 1GB pages support 
>> _easier_ to merge: as at that point we'll already have a (finally..) 
>> successful hugetlb facility happility used by an increasing range of 
>> applications.
> 
> Agreed.
> 
>> Hugetlbfs's big problem was always that it wasnt transparent and hence 
>> wasnt gradual for applications. It was an opt-in and constituted an 
>> interface/ABI change - that is always a big barrier to app adoption.
>>
>> So i give Andrea's patch a very big thumbs up - i hope it gets reviewed 
>> in fine detail and added to -mm ASAP. Our lack of decent, automatic 
>> hugepage support is sticking out like a sore thumb and is hurting us in 
>> high-performance setups. If largepage support within Linux has a chance, 
>> this might be the way to do it.
> 
> Thanks a lot for your review!
> 
>> A small comment regarding the patch itself: i think it could be 
>> simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
>> making it a natural feature of hugepage support. If the code is correct 
>> i cannot see any scenario under which i wouldnt want a hugepage enabled 
>> kernel i'm booting to not have transparent hugepage support as well.
> 
> The two reasons why I added a config option are:
> 
> 1) because it was easy enough, gcc is smart enough to eliminate the
> external calls so I didn't need to add ifdefs with the exception of
> returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
> make the exports of huge_memory.c visible unconditionally so it doesn't
> warn, after that I don't need to build and link huge_memory.o.
> 
> 2) to avoid breaking build of archs not implementing pmd_trans_huge
> and that may never be able to take advantage of it
> 
> But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
> to Y on x86-64 and N on power.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 20:25     ` Chris Wright
@ 2009-10-29 18:51       ` Christoph Lameter
  2009-11-01 10:56         ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2009-10-29 18:51 UTC (permalink / raw)
  To: Chris Wright
  Cc: Andrea Arcangeli, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, 27 Oct 2009, Chris Wright wrote:

> > Yes, swapping is deadly to performance based loads and it should be
> > avoided as much as possible, but it's not nice when in order to get a
> > boost in guest performance when the host isn't low on memory, you lose
> > the ability to swap when the host is low on memory and all VM are
> > locked in memory like in inferior-design virtual machines that won't
> > ever support paging. When system starts swapping the manager can
> > migrate the VM to other hosts with more memory free to restore the
> > full RAM performance as soon as possible. Overcommit can be very
> > useful at maxing out RAM utilization, just like it happens for regular
> > linux tasks (few people runs with overcommit = 2 for this very
> > reason.. besides overcommit = 2 includes swap in its equation so you
> > can still max out ram by adding more free swap).
>
> It's also needed if something like glibc were to take advantage of it in
> a generic manner.

How would glibc do that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-27 18:21   ` Andrea Arcangeli
  2009-10-27 20:25     ` Chris Wright
@ 2009-10-29 18:55     ` Christoph Lameter
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Lameter @ 2009-10-29 18:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, 27 Oct 2009, Andrea Arcangeli wrote:

> Agreed, migration is important on numa systems as much as swapping is
> important on regular hosts, and this patch allows both in the very
> same way with a few liner addition (that is a noop and doesn't modify
> the kernel binary when CONFIG_TRANSPARENT_HUGEPAGE=N). The hugepages
> in this patch should already relocatable just fine with move_pages (I
> say "should" because I didn't test move_pages yet ;).

Another NUMA issue is how MPOL_INTERLEAVE would work with this.
MPOL_INTERLEAVE would cause the spreading of a sequence of pages over a
series of nodes. If you coalesce to one huge page then that cannot be done
anymore.


> > Wont you be running into issues with page dirtying on that level?
>
> Not sure I follow what the problem should be. At the moment when
> pmd_trans_huge is true, the dirty bit is meaningless (hugepages at the
> moment are splitted in place into regular pages before they can be
> converted to swapcache, only after an hugepage becomes swapcache its
> dirty bit on the pte becomes meaningful to handle the case of an
> exclusive swapcache mapped writeable into a single pte and marked
> clean to be able to swap it out at zerocost if memory pressure returns
> and to avoid a cow if the page is written to before it is paged out
> again), but the accessed bit is already handled just fine at the pmd
> level.

May not be a problem as long as you dont allow fs operations with these
pages.

> > Those also had fall back logic to 4k. Does this scheme also allow I/O with
>
> Well maybe I remember your patches wrong, or I might have not followed
> later developments but I was quite sure to remember when we discussed
> it, the reason of the -EIO failure was the fs had softblocksize bigger
> than 4k... and in general fs can't handle blocksize bigger than the
> PAGE_CACHE_SIZE... In effect the core trouble wasnt' the large
> pagecache but the fact the fs wanted a blocksize larger than
> PAGE_SIZE, despite not being able to handle it, if the block was
> splitted in multiple 4k not contiguous areas.

The patches modified the page cache logic to determine the page size from
the page structs.

> > I dont get the point of this. What do you mean by "an operation that
> > cannot fail"? Atomic section?
>
> In short I mean it cannot return -ENOMEM (and an additional bonus is
> that I managed it not to require scheduling or blocking
> operations). The idea is that you can plug it anywhere with a one
> liner and your code becomes hugepage compatible (sure it would run
> faster if you were to teach to your code to handle pmd_trans_huge
> natively but we can't do it all at once :).

We need to know some more detail about the conversion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-29 10:36           ` Andrea Arcangeli
@ 2009-10-30  0:40             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-30  0:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Thu, 29 Oct 2009 11:36:58 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> > A small comment regarding the patch itself: i think it could be 
> > simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
> > making it a natural feature of hugepage support. If the code is correct 
> > i cannot see any scenario under which i wouldnt want a hugepage enabled 
> > kernel i'm booting to not have transparent hugepage support as well.
> 
> The two reasons why I added a config option are:
> 
> 1) because it was easy enough, gcc is smart enough to eliminate the
> external calls so I didn't need to add ifdefs with the exception of
> returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
> make the exports of huge_memory.c visible unconditionally so it doesn't
> warn, after that I don't need to build and link huge_memory.o.
> 
> 2) to avoid breaking build of archs not implementing pmd_trans_huge
> and that may never be able to take advantage of it
> 
> But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
> to Y on x86-64 and N on power.

Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
the kernel, only users can free it. But this new transparent-hugepage seems to
be designed as that the kernel can free it for memory reclaiming.
So, I'd like to handle this in memcg transparently.

But it seems I need several changes to support this new rule.
I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
while.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-10-30  0:40             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-30  0:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Thu, 29 Oct 2009 11:36:58 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> > A small comment regarding the patch itself: i think it could be 
> > simplified further by eliminating CONFIG_TRANSPARENT_HUGEPAGE and by 
> > making it a natural feature of hugepage support. If the code is correct 
> > i cannot see any scenario under which i wouldnt want a hugepage enabled 
> > kernel i'm booting to not have transparent hugepage support as well.
> 
> The two reasons why I added a config option are:
> 
> 1) because it was easy enough, gcc is smart enough to eliminate the
> external calls so I didn't need to add ifdefs with the exception of
> returning 0 from pmd_trans_huge and pmd_trans_frozen. I only had to
> make the exports of huge_memory.c visible unconditionally so it doesn't
> warn, after that I don't need to build and link huge_memory.o.
> 
> 2) to avoid breaking build of archs not implementing pmd_trans_huge
> and that may never be able to take advantage of it
> 
> But we could move CONFIG_TRANSPARENT_HUGEPAGE to an arch define forced
> to Y on x86-64 and N on power.

Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
the kernel, only users can free it. But this new transparent-hugepage seems to
be designed as that the kernel can free it for memory reclaiming.
So, I'd like to handle this in memcg transparently.

But it seems I need several changes to support this new rule.
I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
while.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2009-10-27 20:42 ` Christoph Lameter
@ 2009-10-31 21:29 ` Benjamin Herrenschmidt
  2009-11-03 11:18   ` Andrea Arcangeli
  3 siblings, 1 reply; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2009-10-31 21:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Mon, 2009-10-26 at 19:51 +0100, Andrea Arcangeli wrote:
> Hello,
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing

This isn't possible on all architectures. Some archs have "segment"
constraints which mean only one page size per such "segment". Server
ppc's for example (segment size being either 256M or 1T depending on the
CPU).

> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_SHIFT-PAGE_SHIFT list becomes not
>    null)
> 
> The first (and more tedious) part of this work requires allowing the
> VM to handle anonymous hugepages mixed with regular pages
> transparently on regular anonymous vmas. This is what this patch tries
> to achieve in the least intrusive possible way. We want hugepages and
> hugetlb to be used in a way so that all applications can benefit
> without changes (as usual we leverage the KVM virtualization design:
> by improving the Linux VM at large, KVM gets the performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...

Precisely because the approach cannot work on all architectures ?

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-28 14:54           ` Adam Litke
  2009-10-28 15:13             ` Andi Kleen
  2009-10-29 15:59             ` Dave Hansen
@ 2009-10-31 21:32             ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2009-10-31 21:32 UTC (permalink / raw)
  To: Adam Litke
  Cc: Andi Kleen, Andrea Arcangeli, linux-mm, Marcelo Tosatti,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Wed, 2009-10-28 at 09:54 -0500, Adam Litke wrote:
> 
> PowerPC does not require specific virtual addresses for huge pages, but
> does require that a consistent page size be used for each slice of the
> virtual address space.  Slices are 256M in size from 0 to 4G and 1TB in
> size above 1TB while huge pages are 64k, 16M, or 16G.  Unless the PPC
> guys can work some more magic with their mmu, split_huge_page() in its
> current form just plain won't work on PowerPC.  That doesn't even take
> into account the (already discussed) page table layout differences
> between x86 and ppc: http://linux-mm.org/PageTableStructure . 

Note: this is server powerpc's. Embedded ones are more flexible but on
server we have this limitation and not much we can do about it.

Note also that the "slice" sizes are a SW thing. HW segments are either
256M or 1T (the later being supported only on some processors), and
linux maintains that concept of "slices" in order to simplify the
tracking of said segments and to use 1T when available.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-29 18:51       ` Christoph Lameter
@ 2009-11-01 10:56         ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-11-01 10:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Wright, linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

Hello Christoph,

On Thu, Oct 29, 2009 at 02:51:11PM -0400, Christoph Lameter wrote:
> How would glibc do that?

So the first important thing is to start the mapping at 2M aligned
virtual address. Second important thing is to always do sbrk
increments in 2M chunks and mmap extensions in 2M chunks with a mmap
on the next 2M (mremap right now calls split_huge_page but later we
will teach mremap and mprotect not to call split_huge_page and to
handle the pmd_trans_huge natively so they can run a bit faster).

With those two precautions all page faults will be guaranteed to map
2M pages if they're available (and then fragmentation will decrease
too as 2M pages will be retained in the mappings).

Even after split_huge_page, the pte will point to the same 2M page as
before. And when task is killed all 2M pages will be recombined in the
buddy. Page coloring will also be still guaranteed (up to the 512th
color of course) even after split_huge_page run.

But if split_huge_page is called because munmap is unmapping just a 4k
piece of a 2M page, then split_huge_page will be called to free just
that 4k piece so fragmentation will be created. So the last precaution
that glibc should use is to munmap (or madvise_dontneed) in 2M chunks
naturally aligned to make sure not to create unnecessary
fragmentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-30  0:40             ` KAMEZAWA Hiroyuki
@ 2009-11-03 10:55               ` Andrea Arcangeli
  -1 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-11-03 10:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Fri, Oct 30, 2009 at 09:40:37AM +0900, KAMEZAWA Hiroyuki wrote:
> Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
> Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
> the kernel, only users can free it. But this new transparent-hugepage seems to
> be designed as that the kernel can free it for memory reclaiming.
> So, I'd like to handle this in memcg transparently.
> 
> But it seems I need several changes to support this new rule.
> I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
> while.

Yeah the accounting (not just memcg) should be checked.. I didn't pay
too much attention to stats at this point.

But we want to fix it fast instead of making the two options mutually
exclusive.. Where are the pages de-accounted when they are freed?
Accounting seems to require just two one liners
calling mem_cgroup_newpage_charge.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-11-03 10:55               ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-11-03 10:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Fri, Oct 30, 2009 at 09:40:37AM +0900, KAMEZAWA Hiroyuki wrote:
> Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
> Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
> the kernel, only users can free it. But this new transparent-hugepage seems to
> be designed as that the kernel can free it for memory reclaiming.
> So, I'd like to handle this in memcg transparently.
> 
> But it seems I need several changes to support this new rule.
> I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
> while.

Yeah the accounting (not just memcg) should be checked.. I didn't pay
too much attention to stats at this point.

But we want to fix it fast instead of making the two options mutually
exclusive.. Where are the pages de-accounted when they are freed?
Accounting seems to require just two one liners
calling mem_cgroup_newpage_charge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-10-31 21:29 ` Benjamin Herrenschmidt
@ 2009-11-03 11:18   ` Andrea Arcangeli
  2009-11-03 19:10     ` Dave Hansen
  2009-11-04  4:10     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2009-11-03 11:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote:
> This isn't possible on all architectures. Some archs have "segment"
> constraints which mean only one page size per such "segment". Server
> ppc's for example (segment size being either 256M or 1T depending on the
> CPU).

Hmm 256M is already too large for a transparent allocation. It will
require reservation and hugetlbfs to me actually seems a perfect fit
for this hardware limitation. The software limits of hugetlbfs matches
the hardware limit perfectly and it already provides all necessary
permission and reservation features needed to deal with extremely huge
page sizes that probabilistically would never be found in the buddy
(even if we were to extend it to make it not impossible). That are
hugely expensive to defrag dynamically even if we could [and we can't
hope to defrag many of those because of slab]. Just in case it's not
obvious the probability we can defrag degrades exponentially with the
increase of the hugepagesize (which also means 256M is already orders
of magnitude more realistic to function than than 1G). Clearly if we
increase slab to allocate with a front allocator in 256M chunk then
our probability increases substantially, but to make something
realistic there's at minimum an order of 10000 times between
hugepagesize and total ram size. I.e. if 2M page makes some
probabilistic sense with slab front-allocating 2M pages on a 64G
system, for 256M pages to make an equivalent sense, system would
require minimum 8Terabyte of ram. If pages were 1G sized system would
require 32 Terabyte of ram (and the bigger overhead and trouble we
would have considering some allocation would still happen in 4k ptes
and the fixed overhead of relocating those 4k ranges would be much
bigger if the hugepage size is a lot bigger than 2M and the regular
page size is still 4k).

> > The most important design choice is: always fallback to 4k allocation
> > if the hugepage allocation fails! This is the _very_ opposite of some
> > large pagecache patches that failed with -EIO back then if a 64k (or
> > similar) allocation failed...
> 
> Precisely because the approach cannot work on all architectures ?

I thought the main reason for those patches was to allow a fs
blocksize bigger than PAGE_SIZE, a PAGE_CACHE_SIZE of 64k would allow
for a 64k fs blocksize without much fs changes. But yes, if the mmu
can't fallback, then software can't fallback either and so it impedes
the transparent design on those architectures... To me hugetlbfs looks
as best as you can get on those mmu.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-11-03 11:18   ` Andrea Arcangeli
@ 2009-11-03 19:10     ` Dave Hansen
  2009-11-04  4:10     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2009-11-03 19:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, 2009-11-03 at 12:18 +0100, Andrea Arcangeli wrote:
> On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote:
> > This isn't possible on all architectures. Some archs have "segment"
> > constraints which mean only one page size per such "segment". Server
> > ppc's for example (segment size being either 256M or 1T depending on the
> > CPU).
> 
> Hmm 256M is already too large for a transparent allocation. It will
> require reservation and hugetlbfs to me actually seems a perfect fit
> for this hardware limitation. The software limits of hugetlbfs matches
> the hardware limit perfectly and it already provides all necessary
> permission and reservation features needed to deal with extremely huge
> page sizes that probabilistically would never be found in the buddy
> (even if we were to extend it to make it not impossible).

POWER is pretty unusual in its mmu.  These 256MB (or 1TB) segments are
the granularity with which we must make the choice about page size, but
they *aren't* the page size itself.

We can fill that 256MB segment with any 16MB pages from all over the
physical address space, but we just can't *mix* 4k and 16MB mappings in
the same 256MB virtual area.

16*16MB pages are going to be hard to get, but they are much much easier
to get than 1 256MB page.  But, remember that most ppc64 systems have a
64k page, so the 16MB page is actually only an order-8 allocation.
x86-64's huge pages are order-9.  So, it sucks, but allocating the pages
themselves isn't that big of an issue.  It's getting a big enough
virtual bunch of them together without any small pages in the segment.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-11-03 10:55               ` Andrea Arcangeli
@ 2009-11-04  0:36                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04  0:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Tue, 3 Nov 2009 11:55:43 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Fri, Oct 30, 2009 at 09:40:37AM +0900, KAMEZAWA Hiroyuki wrote:
> > Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
> > Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
> > the kernel, only users can free it. But this new transparent-hugepage seems to
> > be designed as that the kernel can free it for memory reclaiming.
> > So, I'd like to handle this in memcg transparently.
> > 
> > But it seems I need several changes to support this new rule.
> > I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
> > while.
> 
> Yeah the accounting (not just memcg) should be checked.. I didn't pay
> too much attention to stats at this point.
> 
> But we want to fix it fast instead of making the two options mutually
> exclusive.. Where are the pages de-accounted when they are freed?

It's de-accounted at page_remove_rmap() in typical case of Anon.
But swap-cache/bacthed-uncarhge related part is complicated, maybe.
...because of me ;(

Okay, I don't request !CONFIG_CGROUP_MEM_RES_CTRL, I'm glad if you CC me.

> Accounting seems to require just two one liners
> calling mem_cgroup_newpage_charge.
Yes, maybe so.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
@ 2009-11-04  0:36                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 43+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04  0:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andi Kleen, linux-mm, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Izik Eidus, Hugh Dickins, Nick Piggin, Andrew Morton,
	linux-kernel

On Tue, 3 Nov 2009 11:55:43 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Fri, Oct 30, 2009 at 09:40:37AM +0900, KAMEZAWA Hiroyuki wrote:
> > Ah, please keep CONFIG_TRANSPARENT_HUGEPAGE for a while.
> > Now, memcg don't handle hugetlbfs because it's special and cannot be freed by
> > the kernel, only users can free it. But this new transparent-hugepage seems to
> > be designed as that the kernel can free it for memory reclaiming.
> > So, I'd like to handle this in memcg transparently.
> > 
> > But it seems I need several changes to support this new rule.
> > I'm glad if this new huge page depends on !CONFIG_CGROUP_MEM_RES_CTRL for a
> > while.
> 
> Yeah the accounting (not just memcg) should be checked.. I didn't pay
> too much attention to stats at this point.
> 
> But we want to fix it fast instead of making the two options mutually
> exclusive.. Where are the pages de-accounted when they are freed?

It's de-accounted at page_remove_rmap() in typical case of Anon.
But swap-cache/bacthed-uncarhge related part is complicated, maybe.
...because of me ;(

Okay, I don't request !CONFIG_CGROUP_MEM_RES_CTRL, I'm glad if you CC me.

> Accounting seems to require just two one liners
> calling mem_cgroup_newpage_charge.
Yes, maybe so.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: RFC: Transparent Hugepage support
  2009-11-03 11:18   ` Andrea Arcangeli
  2009-11-03 19:10     ` Dave Hansen
@ 2009-11-04  4:10     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2009-11-04  4:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Marcelo Tosatti, Adam Litke, Avi Kivity, Izik Eidus,
	Hugh Dickins, Nick Piggin, Andrew Morton

On Tue, 2009-11-03 at 12:18 +0100, Andrea Arcangeli wrote:
> On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote:
> > This isn't possible on all architectures. Some archs have "segment"
> > constraints which mean only one page size per such "segment". Server
> > ppc's for example (segment size being either 256M or 1T depending on the
> > CPU).
> 
> Hmm 256M is already too large for a transparent allocation. 

Right.

> It will
> require reservation and hugetlbfs to me actually seems a perfect fit
> for this hardware limitation. The software limits of hugetlbfs matches
> the hardware limit perfectly and it already provides all necessary
> permission and reservation features needed to deal with extremely huge
> page sizes that probabilistically would never be found in the buddy
> (even if we were to extend it to make it not impossible). 

Yes. Note that powerpc -embedded- processors don't have that limitation
though (in large part because they are mostly SW loaded TLBs and they
support a wider collection of page sizes). So it would be possible to
implement your transparent scheme on those.

> That are
> hugely expensive to defrag dynamically even if we could [and we can't
> hope to defrag many of those because of slab]. Just in case it's not
> obvious the probability we can defrag degrades exponentially with the
> increase of the hugepagesize (which also means 256M is already orders
> of magnitude more realistic to function than than 1G).

True. 256M might even be worth toying with as an experiment on huge
machines with TBs of memory in fact :-)

>  Clearly if we
> increase slab to allocate with a front allocator in 256M chunk then
> our probability increases substantially, but to make something
> realistic there's at minimum an order of 10000 times between
> hugepagesize and total ram size. I.e. if 2M page makes some
> probabilistic sense with slab front-allocating 2M pages on a 64G
> system, for 256M pages to make an equivalent sense, system would
> require minimum 8Terabyte of ram.

Well... such systems aren't that far around the corner, so as I said, it
might still make sense to toy a bit with it. That would definitely -not-
include my G5 workstation though :-)

>  If pages were 1G sized system would
> require 32 Terabyte of ram (and the bigger overhead and trouble we
> would have considering some allocation would still happen in 4k ptes
> and the fixed overhead of relocating those 4k ranges would be much
> bigger if the hugepage size is a lot bigger than 2M and the regular
> page size is still 4k).
> 
> > > The most important design choice is: always fallback to 4k allocation
> > > if the hugepage allocation fails! This is the _very_ opposite of some
> > > large pagecache patches that failed with -EIO back then if a 64k (or
> > > similar) allocation failed...
> > 
> > Precisely because the approach cannot work on all architectures ?
> 
> I thought the main reason for those patches was to allow a fs
> blocksize bigger than PAGE_SIZE, a PAGE_CACHE_SIZE of 64k would allow
> for a 64k fs blocksize without much fs changes. But yes, if the mmu
> can't fallback, then software can't fallback either and so it impedes
> the transparent design on those architectures... To me hugetlbfs looks
> as best as you can get on those mmu.

Right.

I need to look whether your patch would work "better" for us with our
embedded processors though.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2009-11-04  4:10 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
2009-10-27 15:41 ` Rik van Riel
2009-10-27 18:18 ` Andi Kleen
2009-10-27 19:30   ` Andrea Arcangeli
2009-10-28  4:28     ` Andi Kleen
2009-10-28 12:00       ` Andrea Arcangeli
2009-10-28 14:18         ` Andi Kleen
2009-10-28 14:54           ` Adam Litke
2009-10-28 15:13             ` Andi Kleen
2009-10-28 15:30               ` Andrea Arcangeli
2009-10-29 15:59             ` Dave Hansen
2009-10-31 21:32             ` Benjamin Herrenschmidt
2009-10-28 15:48           ` Andrea Arcangeli
2009-10-28 16:03             ` Andi Kleen
2009-10-28 16:22               ` Andrea Arcangeli
2009-10-28 16:34                 ` Andi Kleen
2009-10-28 16:56                   ` Adam Litke
2009-10-28 17:18                     ` Andi Kleen
2009-10-28 19:04                   ` Andrea Arcangeli
2009-10-28 19:22                     ` Andrea Arcangeli
2009-10-29  9:43       ` Ingo Molnar
2009-10-29  9:43         ` Ingo Molnar
2009-10-29 10:36         ` Andrea Arcangeli
2009-10-29 10:36           ` Andrea Arcangeli
2009-10-29 16:50           ` Mike Travis
2009-10-29 16:50             ` Mike Travis
2009-10-30  0:40           ` KAMEZAWA Hiroyuki
2009-10-30  0:40             ` KAMEZAWA Hiroyuki
2009-11-03 10:55             ` Andrea Arcangeli
2009-11-03 10:55               ` Andrea Arcangeli
2009-11-04  0:36               ` KAMEZAWA Hiroyuki
2009-11-04  0:36                 ` KAMEZAWA Hiroyuki
2009-10-29 12:54     ` Andrea Arcangeli
2009-10-27 20:42 ` Christoph Lameter
2009-10-27 18:21   ` Andrea Arcangeli
2009-10-27 20:25     ` Chris Wright
2009-10-29 18:51       ` Christoph Lameter
2009-11-01 10:56         ` Andrea Arcangeli
2009-10-29 18:55     ` Christoph Lameter
2009-10-31 21:29 ` Benjamin Herrenschmidt
2009-11-03 11:18   ` Andrea Arcangeli
2009-11-03 19:10     ` Dave Hansen
2009-11-04  4:10     ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.