All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Chris Wright <chrisw@sous-sol.org>,
	bpicco@redhat.com,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Chris Mason <chris.mason@oracle.com>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH 30 of 66] transparent hugepage core
Date: Thu, 18 Nov 2010 15:12:21 +0000	[thread overview]
Message-ID: <20101118151221.GT8135@csn.ul.ie> (raw)
In-Reply-To: <a7507af3a1dcae5c52a4.1288798085@v2.random>

On Wed, Nov 03, 2010 at 04:28:05PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
>    not null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 

All that seems fine to me. Nits in part that are simply not worth
calling out. In principal, I Agree With This :)

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> * * *
> adapt to mm_counter in -mm
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> The interface changed slightly.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> * * *
> transparent hugepage bootparam
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allow transparent_hugepage=always|never|madvise at boot.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
>  	return pmd_set_flags(pmd, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -108,6 +108,9 @@ struct vm_area_struct;
>  				 __GFP_HARDWALL | __GFP_HIGHMEM | \
>  				 __GFP_MOVABLE)
>  #define GFP_IOFS	(__GFP_IO | __GFP_FS)
> +#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> +			 __GFP_NO_KSWAPD)
>  
>  #ifdef CONFIG_NUMA
>  #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,126 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address, pmd_t *pmd,
> +				      unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)				\
> +	((transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern int split_huge_page(struct page *page);
> +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
> +#define split_huge_page_pmd(__mm, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\
> +			__split_huge_page_pmd(__mm, ____pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> +		/*							\
> +		 * spin_unlock_wait() is just a loop in C and so the	\
> +		 * CPU can reorder anything around it.			\
> +		 */							\
> +		smp_mb();						\

Just a note as I see nothing wrong with this but that's a good spot. The
unlock isn't a memory barrier. Out of curiousity, does it really need to be
a full barrier or would a write barrier have been enough?

> +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> +		       pmd_trans_huge(*____pmd));			\
> +	} while (0)
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> +	VM_BUG_ON(PageTail(page));
> +	return PageHead(page);
> +}

gfp.h seems an odd place for these. Should the flags go in page-flags.h
and maybe put vma_address() in internal.h?

Not a biggie.

> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
> +#define transparent_hugepage_enabled(__vma) 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_pmd(__mm, __pmd)	\
> +	do { } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
>  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> +#endif
>  
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -240,6 +243,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
>  }
>  
>  static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> +		       struct list_head *head)
> +{
> +	list_add(&page->lru, head);
> +	__inc_zone_state(zone, NR_LRU_BASE + l);
> +	mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
>  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>  {
> -	list_add(&page->lru, &zone->lru[l].list);
> -	__inc_zone_state(zone, NR_LRU_BASE + l);
> -	mem_cgroup_add_lru_list(page, l);
> +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
>  }
>  

Do these really need to be in a public header or can they move to
mm/swap.c?

>  static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
>  /* linux/mm/swap.c */
>  extern void __lru_cache_add(struct page *, enum lru_list lru);
>  extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> +			      struct page *page, struct page *page_tail);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,899 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +/*
> + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> + * memory just to allocate one more hugepage.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init hugepage_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(hugepage_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	int ret = 0;
> +	if (!str)
> +		goto out;
> +	if (!strcmp(str, "always")) {
> +		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			&transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "madvise")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			&transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "never")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	}
> +out:
> +	if (!ret)
> +		printk(KERN_WARNING
> +		       "transparent_hugepage= cannot parse, ignored\n");
> +	return ret;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +

assert_spin_locked() ?

> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long haddr, pmd_t *pmd,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, haddr);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> +	__SetPageUptodate(page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		/*
> +		 * The spinlocking to take the lru_lock inside
> +		 * page_add_new_anon_rmap() acts as a full memory
> +		 * barrier to be sure clear_huge_page writes become
> +		 * visible after the set_pmd_at() write.
> +		 */
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +		spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
> +			   HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
> +	}
> +out:
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> +		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd))) {
> +		pte_free(dst_mm, pgtable);
> +		goto out_unlock;
> +	}
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +		pte_free(dst_mm, pgtable);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_page(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL;
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmd, pmd_t orig_pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int ret = 0, i;
> +	struct page **pages;
> +
> +	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> +			GFP_KERNEL);
> +	if (unlikely(!pages)) {
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +					  vma, address);
> +		if (unlikely(!pages[i])) {
> +			while (--i >= 0)
> +				put_page(pages[i]);
> +			kfree(pages);
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		copy_user_highpage(pages[i], page + i,
> +				   haddr + PAGE_SHIFT*i, vma);
> +		__SetPageUptodate(pages[i]);
> +		cond_resched();
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	VM_BUG_ON(!PageHead(page));
> +
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/* leave pmd empty until pte is filled */
> +
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = mk_pte(pages[i], vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	kfree(pages);
> +
> +	mm->nr_ptes++;
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	page_remove_rmap(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	ret |= VM_FAULT_WRITE;
> +	put_page(page);
> +
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	spin_unlock(&mm->page_table_lock);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_page(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_PMD_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	if (transparent_hugepage_enabled(vma) &&
> +	    !transparent_hugepage_debug_cow())
> +		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +	else
> +		new_page = NULL;
> +
> +	if (unlikely(!new_page)) {
> +		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> +						   pmd, orig_pmd, page, haddr);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +	__SetPageUptodate(new_page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	put_page(page);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		VM_BUG_ON(!PageHead(page));
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		 pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pmd_page(*pmd);
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			VM_BUG_ON(!PageHead(page));
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	if (address & ~HPAGE_PMD_MASK)
> +		goto out;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_page(*pmd) != page)
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd)) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, the pmd must remain marked huge at all
> +		 * times or the VM won't take the pmd_trans_huge paths
> +		 * and it won't wait on the anon_vma->root->lock to
> +		 * serialize against split_huge_page*.
> +		 */
> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +	struct zone *zone = page_zone(page);
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +
> +		page_tail->index = ++head_index;
> +
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		lru_add_page_tail(zone, page, page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +		BUG_ON(page_count(page_tail) <= 0);
> +		/*
> +		 * Tail pages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(page_tail);
> +	}
> +
> +	/*
> +	 * Only the head page (now become a regular page) is required
> +	 * to be pinned by the caller.
> +	 */
> +	BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			BUG_ON(PageCompound(page+i));
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		/*
> +		 * Up to this point the pmd is present and huge and
> +		 * userland has the whole access to the hugepage
> +		 * during the split (which happens in place). If we
> +		 * overwrite the pmd with the not-huge version
> +		 * pointing to the pte here (which of course we could
> +		 * if all CPUs were bug free), userland could trigger
> +		 * a small page size TLB miss on the small sized TLB
> +		 * while the hugepage TLB entry is still established
> +		 * in the huge TLB. Some CPU doesn't like that. See
> +		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> +		 * Erratum 383 on page 93. Intel should be safe but is
> +		 * also warns that it's only safe if the permission
> +		 * and cache attributes of the two entries loaded in
> +		 * the two TLB is identical (which should be the case
> +		 * here). But it is generally safer to never allow
> +		 * small and huge TLB entries for the same virtual
> +		 * address to be loaded simultaneously. So instead of
> +		 * doing "pmd_populate(); flush_tlb_range();" we first
> +		 * mark the current pmd notpresent (atomically because
> +		 * here the pmd_trans_huge and pmd_trans_splitting
> +		 * must remain set at all times on the pmd until the
> +		 * split is complete for this pmd), then we flush the
> +		 * SMP TLB and finally we write the non-huge version
> +		 * of the pmd entry with pmd_populate.
> +		 */
> +		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->root->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct anon_vma_chain *avc;
> +
> +	BUG_ON(!PageHead(page));
> +	BUG_ON(PageTail(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> +	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> +
> +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	struct page *page;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		return;
> +	}
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!page_count(page));
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	split_huge_page(page);
> +
> +	put_page(page);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -726,9 +726,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_PMD_SIZE)
> +				split_huge_page_pmd(vma->vm_mm, pmd);
> +			else if (zap_huge_pmd(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> -	pte = pte_alloc_map(mm, vma, pmd, address);
> -	if (!pte)
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_pmd_anonymous_page(mm, vma, address,
> +							  pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();

What is this barrier for?

> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_pmd_wp_page(mm, vma, address,
> +							   pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
> +
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
>  		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
>  
>  	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
>  	pmd = pmd_offset(pud, address);
>  	if (!pmd_present(*pmd))
>  		return NULL;
> +	if (pmd_trans_huge(*pmd))
> +		return NULL;
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> @@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageTransHuge(page))) {
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
>  
>  EXPORT_SYMBOL(__pagevec_release);
>  
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> +		       struct page *page, struct page *page_tail)
> +{
> +	int active;
> +	enum lru_list lru;
> +	const int file = 0;
> +	struct list_head *head;
> +
> +	VM_BUG_ON(!PageHead(page));
> +	VM_BUG_ON(PageCompound(page_tail));
> +	VM_BUG_ON(PageLRU(page_tail));
> +	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> +	SetPageLRU(page_tail);
> +
> +	if (page_evictable(page_tail, NULL)) {
> +		if (PageActive(page)) {
> +			SetPageActive(page_tail);
> +			active = 1;
> +			lru = LRU_ACTIVE_ANON;
> +		} else {
> +			active = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
> +		update_page_reclaim_stat(zone, page_tail, file, active);
> +		if (likely(PageLRU(page)))
> +			head = page->lru.prev;
> +		else
> +			head = &zone->lru[lru].list;
> +		__add_page_to_lru_list(zone, page_tail, lru, head);
> +	} else {
> +		SetPageUnevictable(page_tail);
> +		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> +	}
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
> 

Other than a few minor questions, these seems very similar to what you
had before. There is a lot going on in this patch but I did not find
anything wrong.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Ingo Molnar <mingo@elte.hu>, Mike Travis <travis@sgi.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Chris Wright <chrisw@sous-sol.org>,
	bpicco@redhat.com,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Chris Mason <chris.mason@oracle.com>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH 30 of 66] transparent hugepage core
Date: Thu, 18 Nov 2010 15:12:21 +0000	[thread overview]
Message-ID: <20101118151221.GT8135@csn.ul.ie> (raw)
In-Reply-To: <a7507af3a1dcae5c52a4.1288798085@v2.random>

On Wed, Nov 03, 2010 at 04:28:05PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
>    not null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 

All that seems fine to me. Nits in part that are simply not worth
calling out. In principal, I Agree With This :)

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> * * *
> adapt to mm_counter in -mm
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> The interface changed slightly.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> * * *
> transparent hugepage bootparam
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allow transparent_hugepage=always|never|madvise at boot.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
>  	return pmd_set_flags(pmd, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -108,6 +108,9 @@ struct vm_area_struct;
>  				 __GFP_HARDWALL | __GFP_HIGHMEM | \
>  				 __GFP_MOVABLE)
>  #define GFP_IOFS	(__GFP_IO | __GFP_FS)
> +#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> +			 __GFP_NO_KSWAPD)
>  
>  #ifdef CONFIG_NUMA
>  #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,126 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address, pmd_t *pmd,
> +				      unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)				\
> +	((transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern int split_huge_page(struct page *page);
> +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
> +#define split_huge_page_pmd(__mm, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\
> +			__split_huge_page_pmd(__mm, ____pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> +		/*							\
> +		 * spin_unlock_wait() is just a loop in C and so the	\
> +		 * CPU can reorder anything around it.			\
> +		 */							\
> +		smp_mb();						\

Just a note as I see nothing wrong with this but that's a good spot. The
unlock isn't a memory barrier. Out of curiousity, does it really need to be
a full barrier or would a write barrier have been enough?

> +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> +		       pmd_trans_huge(*____pmd));			\
> +	} while (0)
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> +	VM_BUG_ON(PageTail(page));
> +	return PageHead(page);
> +}

gfp.h seems an odd place for these. Should the flags go in page-flags.h
and maybe put vma_address() in internal.h?

Not a biggie.

> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
> +#define transparent_hugepage_enabled(__vma) 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_pmd(__mm, __pmd)	\
> +	do { } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
>  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> +#endif
>  
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -240,6 +243,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
>  }
>  
>  static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> +		       struct list_head *head)
> +{
> +	list_add(&page->lru, head);
> +	__inc_zone_state(zone, NR_LRU_BASE + l);
> +	mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
>  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>  {
> -	list_add(&page->lru, &zone->lru[l].list);
> -	__inc_zone_state(zone, NR_LRU_BASE + l);
> -	mem_cgroup_add_lru_list(page, l);
> +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
>  }
>  

Do these really need to be in a public header or can they move to
mm/swap.c?

>  static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
>  /* linux/mm/swap.c */
>  extern void __lru_cache_add(struct page *, enum lru_list lru);
>  extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> +			      struct page *page, struct page *page_tail);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,899 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +/*
> + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> + * memory just to allocate one more hugepage.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init hugepage_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(hugepage_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	int ret = 0;
> +	if (!str)
> +		goto out;
> +	if (!strcmp(str, "always")) {
> +		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			&transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "madvise")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			&transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "never")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	}
> +out:
> +	if (!ret)
> +		printk(KERN_WARNING
> +		       "transparent_hugepage= cannot parse, ignored\n");
> +	return ret;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +

assert_spin_locked() ?

> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long haddr, pmd_t *pmd,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, haddr);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> +	__SetPageUptodate(page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		/*
> +		 * The spinlocking to take the lru_lock inside
> +		 * page_add_new_anon_rmap() acts as a full memory
> +		 * barrier to be sure clear_huge_page writes become
> +		 * visible after the set_pmd_at() write.
> +		 */
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +		spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
> +			   HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
> +	}
> +out:
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> +		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd))) {
> +		pte_free(dst_mm, pgtable);
> +		goto out_unlock;
> +	}
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +		pte_free(dst_mm, pgtable);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_page(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL;
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmd, pmd_t orig_pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int ret = 0, i;
> +	struct page **pages;
> +
> +	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> +			GFP_KERNEL);
> +	if (unlikely(!pages)) {
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +					  vma, address);
> +		if (unlikely(!pages[i])) {
> +			while (--i >= 0)
> +				put_page(pages[i]);
> +			kfree(pages);
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		copy_user_highpage(pages[i], page + i,
> +				   haddr + PAGE_SHIFT*i, vma);
> +		__SetPageUptodate(pages[i]);
> +		cond_resched();
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	VM_BUG_ON(!PageHead(page));
> +
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/* leave pmd empty until pte is filled */
> +
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = mk_pte(pages[i], vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	kfree(pages);
> +
> +	mm->nr_ptes++;
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	page_remove_rmap(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	ret |= VM_FAULT_WRITE;
> +	put_page(page);
> +
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	spin_unlock(&mm->page_table_lock);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_page(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_PMD_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	if (transparent_hugepage_enabled(vma) &&
> +	    !transparent_hugepage_debug_cow())
> +		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +	else
> +		new_page = NULL;
> +
> +	if (unlikely(!new_page)) {
> +		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> +						   pmd, orig_pmd, page, haddr);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +	__SetPageUptodate(new_page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	put_page(page);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		VM_BUG_ON(!PageHead(page));
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		 pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pmd_page(*pmd);
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			VM_BUG_ON(!PageHead(page));
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	if (address & ~HPAGE_PMD_MASK)
> +		goto out;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_page(*pmd) != page)
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd)) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, the pmd must remain marked huge at all
> +		 * times or the VM won't take the pmd_trans_huge paths
> +		 * and it won't wait on the anon_vma->root->lock to
> +		 * serialize against split_huge_page*.
> +		 */
> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +	struct zone *zone = page_zone(page);
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +
> +		page_tail->index = ++head_index;
> +
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		lru_add_page_tail(zone, page, page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +		BUG_ON(page_count(page_tail) <= 0);
> +		/*
> +		 * Tail pages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(page_tail);
> +	}
> +
> +	/*
> +	 * Only the head page (now become a regular page) is required
> +	 * to be pinned by the caller.
> +	 */
> +	BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			BUG_ON(PageCompound(page+i));
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		/*
> +		 * Up to this point the pmd is present and huge and
> +		 * userland has the whole access to the hugepage
> +		 * during the split (which happens in place). If we
> +		 * overwrite the pmd with the not-huge version
> +		 * pointing to the pte here (which of course we could
> +		 * if all CPUs were bug free), userland could trigger
> +		 * a small page size TLB miss on the small sized TLB
> +		 * while the hugepage TLB entry is still established
> +		 * in the huge TLB. Some CPU doesn't like that. See
> +		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> +		 * Erratum 383 on page 93. Intel should be safe but is
> +		 * also warns that it's only safe if the permission
> +		 * and cache attributes of the two entries loaded in
> +		 * the two TLB is identical (which should be the case
> +		 * here). But it is generally safer to never allow
> +		 * small and huge TLB entries for the same virtual
> +		 * address to be loaded simultaneously. So instead of
> +		 * doing "pmd_populate(); flush_tlb_range();" we first
> +		 * mark the current pmd notpresent (atomically because
> +		 * here the pmd_trans_huge and pmd_trans_splitting
> +		 * must remain set at all times on the pmd until the
> +		 * split is complete for this pmd), then we flush the
> +		 * SMP TLB and finally we write the non-huge version
> +		 * of the pmd entry with pmd_populate.
> +		 */
> +		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->root->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct anon_vma_chain *avc;
> +
> +	BUG_ON(!PageHead(page));
> +	BUG_ON(PageTail(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> +	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> +
> +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	struct page *page;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		return;
> +	}
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!page_count(page));
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	split_huge_page(page);
> +
> +	put_page(page);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -726,9 +726,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_PMD_SIZE)
> +				split_huge_page_pmd(vma->vm_mm, pmd);
> +			else if (zap_huge_pmd(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> -	pte = pte_alloc_map(mm, vma, pmd, address);
> -	if (!pte)
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_pmd_anonymous_page(mm, vma, address,
> +							  pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();

What is this barrier for?

> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_pmd_wp_page(mm, vma, address,
> +							   pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
> +
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
>  		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
>  
>  	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
>  	pmd = pmd_offset(pud, address);
>  	if (!pmd_present(*pmd))
>  		return NULL;
> +	if (pmd_trans_huge(*pmd))
> +		return NULL;
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> @@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageTransHuge(page))) {
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
>  
>  EXPORT_SYMBOL(__pagevec_release);
>  
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> +		       struct page *page, struct page *page_tail)
> +{
> +	int active;
> +	enum lru_list lru;
> +	const int file = 0;
> +	struct list_head *head;
> +
> +	VM_BUG_ON(!PageHead(page));
> +	VM_BUG_ON(PageCompound(page_tail));
> +	VM_BUG_ON(PageLRU(page_tail));
> +	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> +	SetPageLRU(page_tail);
> +
> +	if (page_evictable(page_tail, NULL)) {
> +		if (PageActive(page)) {
> +			SetPageActive(page_tail);
> +			active = 1;
> +			lru = LRU_ACTIVE_ANON;
> +		} else {
> +			active = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
> +		update_page_reclaim_stat(zone, page_tail, file, active);
> +		if (likely(PageLRU(page)))
> +			head = page->lru.prev;
> +		else
> +			head = &zone->lru[lru].list;
> +		__add_page_to_lru_list(zone, page_tail, lru, head);
> +	} else {
> +		SetPageUnevictable(page_tail);
> +		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> +	}
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
> 

Other than a few minor questions, these seems very similar to what you
had before. There is a lot going on in this patch but I did not find
anything wrong.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-11-18 15:12 UTC|newest]

Thread overview: 331+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-03 15:27 [PATCH 00 of 66] Transparent Hugepage Support #32 Andrea Arcangeli
2010-11-03 15:27 ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 01 of 66] disable lumpy when compaction is enabled Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:18   ` KOSAKI Motohiro
2010-11-09  3:18     ` KOSAKI Motohiro
2010-11-09 21:30     ` Andrea Arcangeli
2010-11-09 21:30       ` Andrea Arcangeli
2010-11-09 21:38       ` Mel Gorman
2010-11-09 21:38         ` Mel Gorman
2010-11-09 22:22         ` Andrea Arcangeli
2010-11-09 22:22           ` Andrea Arcangeli
2010-11-10 14:27           ` Mel Gorman
2010-11-10 14:27             ` Mel Gorman
2010-11-10 16:03             ` Andrea Arcangeli
2010-11-10 16:03               ` Andrea Arcangeli
2010-11-18  8:30   ` Mel Gorman
2010-11-18  8:30     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:01   ` KOSAKI Motohiro
2010-11-09  3:01     ` KOSAKI Motohiro
2010-11-09 21:36     ` Andrea Arcangeli
2010-11-09 21:36       ` Andrea Arcangeli
2010-11-18 11:13   ` Mel Gorman
2010-11-18 11:13     ` Mel Gorman
2010-11-18 17:13     ` Mel Gorman
2010-11-18 17:13       ` Mel Gorman
2010-11-19 17:38     ` Andrea Arcangeli
2010-11-19 17:38       ` Andrea Arcangeli
2010-11-19 17:54       ` Linus Torvalds
2010-11-19 17:54         ` Linus Torvalds
2010-11-19 19:26         ` Andrea Arcangeli
2010-11-19 19:26           ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 03 of 66] transparent hugepage support documentation Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:41   ` Mel Gorman
2010-11-18 11:41     ` Mel Gorman
2010-11-25 14:35     ` Andrea Arcangeli
2010-11-25 14:35       ` Andrea Arcangeli
2010-11-26 11:40       ` Mel Gorman
2010-11-26 11:40         ` Mel Gorman
2010-11-03 15:27 ` [PATCH 04 of 66] define MADV_HUGEPAGE Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:44   ` Mel Gorman
2010-11-18 11:44     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 05 of 66] compound_lock Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:49   ` Mel Gorman
2010-11-18 11:49     ` Mel Gorman
2010-11-18 17:28     ` Linus Torvalds
2010-11-18 17:28       ` Linus Torvalds
2010-11-25 16:21       ` Andrea Arcangeli
2010-11-25 16:21         ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 06 of 66] alter compound get_page/put_page Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:37   ` Mel Gorman
2010-11-18 12:37     ` Mel Gorman
2010-11-25 16:49     ` Andrea Arcangeli
2010-11-25 16:49       ` Andrea Arcangeli
2010-11-26 11:46       ` Mel Gorman
2010-11-26 11:46         ` Mel Gorman
2010-11-03 15:27 ` [PATCH 07 of 66] update futex compound knowledge Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 08 of 66] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:03   ` KOSAKI Motohiro
2010-11-09  3:03     ` KOSAKI Motohiro
2010-11-18 12:40   ` Mel Gorman
2010-11-18 12:40     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 09 of 66] clear compound mapping Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 10 of 66] add native_set_pmd_at Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 11 of 66] add pmd paravirt ops Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-26 18:01   ` Andrea Arcangeli
2010-11-26 18:01     ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 12 of 66] no paravirt version of pmd ops Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 13 of 66] export maybe_mkwrite Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2011-01-17 14:14   ` Michal Simek
2011-01-17 14:14     ` Michal Simek
2011-01-17 14:33     ` Andrea Arcangeli
2011-01-17 14:33       ` Andrea Arcangeli
2011-01-18 14:29       ` Michal Simek
2011-01-18 14:29         ` Michal Simek
2011-01-18 20:32         ` Andrea Arcangeli
2011-01-18 20:32           ` Andrea Arcangeli
2011-01-20  7:03           ` Michal Simek
2010-11-03 15:27 ` [PATCH 14 of 66] comment reminder in destroy_compound_page Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 15 of 66] config_transparent_hugepage Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 16 of 66] special pmd_trans_* functions Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:51   ` Mel Gorman
2010-11-18 12:51     ` Mel Gorman
2010-11-25 17:10     ` Andrea Arcangeli
2010-11-25 17:10       ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 17 of 66] add pmd mangling generic functions Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:52   ` Mel Gorman
2010-11-18 12:52     ` Mel Gorman
2010-11-18 17:32     ` Linus Torvalds
2010-11-18 17:32       ` Linus Torvalds
2010-11-25 17:35       ` Andrea Arcangeli
2010-11-25 17:35         ` Andrea Arcangeli
2010-11-26 22:24         ` Linus Torvalds
2010-11-26 22:24           ` Linus Torvalds
2010-12-02 17:50           ` Andrea Arcangeli
2010-12-02 17:50             ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 18 of 66] add pmd mangling functions to x86 Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:04   ` Mel Gorman
2010-11-18 13:04     ` Mel Gorman
2010-11-26 17:57     ` Andrea Arcangeli
2010-11-26 17:57       ` Andrea Arcangeli
2010-11-29 10:23       ` Mel Gorman
2010-11-29 10:23         ` Mel Gorman
2010-11-29 16:59         ` Andrea Arcangeli
2010-11-29 16:59           ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 19 of 66] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 20 of 66] pte alloc trans splitting Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 21 of 66] add pmd mmu_notifier helpers Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 22 of 66] clear page compound Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:11   ` Mel Gorman
2010-11-18 13:11     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 23 of 66] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:13   ` Mel Gorman
2010-11-18 13:13     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 24 of 66] split_huge_page_mm/vma Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 25 of 66] split_huge_page paging Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 26 of 66] clear_copy_huge_page Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 27 of 66] kvm mmu transparent hugepage support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 28 of 66] _GFP_NO_KSWAPD Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 13:18   ` Mel Gorman
2010-11-18 13:18     ` Mel Gorman
2010-11-29 19:03     ` Andrea Arcangeli
2010-11-29 19:03       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  3:05   ` KOSAKI Motohiro
2010-11-09  3:05     ` KOSAKI Motohiro
2010-11-18 13:19   ` Mel Gorman
2010-11-18 13:19     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 30 of 66] transparent hugepage core Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:12   ` Mel Gorman [this message]
2010-11-18 15:12     ` Mel Gorman
2010-12-07 21:24     ` Andrea Arcangeli
2010-12-07 21:24       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 31 of 66] split_huge_page anon_vma ordering dependency Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:13   ` Mel Gorman
2010-11-18 15:13     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 32 of 66] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:15   ` Mel Gorman
2010-11-18 15:15     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 33 of 66] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:19   ` Mel Gorman
2010-11-18 15:19     ` Mel Gorman
2010-12-09 17:14     ` Andrea Arcangeli
2010-12-09 17:14       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 34 of 66] add PageTransCompound Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:20   ` Mel Gorman
2010-11-18 15:20     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 35 of 66] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 36 of 66] memcg compound Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:26   ` Mel Gorman
2010-11-18 15:26     ` Mel Gorman
2010-11-19  1:10     ` KAMEZAWA Hiroyuki
2010-11-19  1:10       ` KAMEZAWA Hiroyuki
2010-12-14 17:38       ` Andrea Arcangeli
2010-12-14 17:38         ` Andrea Arcangeli
2010-12-15  0:12         ` KAMEZAWA Hiroyuki
2010-12-15  0:12           ` KAMEZAWA Hiroyuki
2010-12-15  5:29           ` Andrea Arcangeli
2010-12-15  5:29             ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 37 of 66] transhuge-memcg: commit tail pages at charge Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 38 of 66] memcontrol: try charging huge pages from stock Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-19  1:14   ` KAMEZAWA Hiroyuki
2010-11-19  1:14     ` KAMEZAWA Hiroyuki
2010-12-14 17:38     ` Andrea Arcangeli
2010-12-14 17:38       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 39 of 66] memcg huge memory Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-19  1:19   ` KAMEZAWA Hiroyuki
2010-11-19  1:19     ` KAMEZAWA Hiroyuki
2010-12-14 17:38     ` Andrea Arcangeli
2010-12-14 17:38       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 40 of 66] transparent hugepage vmstat Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 41 of 66] khugepaged Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 42 of 66] khugepaged vma merge Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  3:08   ` KOSAKI Motohiro
2010-11-09  3:08     ` KOSAKI Motohiro
2010-11-09 21:40     ` Andrea Arcangeli
2010-11-09 21:40       ` Andrea Arcangeli
2010-11-10  7:49       ` Hugh Dickins
2010-11-10  7:49         ` Hugh Dickins
2010-11-10 16:08         ` Andrea Arcangeli
2010-11-10 16:08           ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 44 of 66] skip transhuge pages in ksm for now Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:06   ` Mel Gorman
2010-11-18 16:06     ` Mel Gorman
2010-12-09 18:13     ` Andrea Arcangeli
2010-12-09 18:13       ` Andrea Arcangeli
2010-12-10 12:17       ` Mel Gorman
2010-12-10 12:17         ` Mel Gorman
2010-11-03 15:28 ` [PATCH 45 of 66] remove PG_buddy Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:08   ` Mel Gorman
2010-11-18 16:08     ` Mel Gorman
2010-12-09 18:15     ` Andrea Arcangeli
2010-12-09 18:15       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 46 of 66] add x86 32bit support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 47 of 66] mincore transparent hugepage support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 48 of 66] add pmd_modify Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 49 of 66] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 50 of 66] mprotect: transparent huge page support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 51 of 66] set recommended min free kbytes Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:16   ` Mel Gorman
2010-11-18 16:16     ` Mel Gorman
2010-12-09 18:47     ` Andrea Arcangeli
2010-12-09 18:47       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 52 of 66] enable direct defrag Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:17   ` Mel Gorman
2010-11-18 16:17     ` Mel Gorman
2010-12-09 18:57     ` Andrea Arcangeli
2010-12-09 18:57       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 53 of 66] add numa awareness to hugepage allocations Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-29  5:38   ` Daisuke Nishimura
2010-11-29  5:38     ` Daisuke Nishimura
2010-11-29 16:11     ` Andrea Arcangeli
2010-11-29 16:11       ` Andrea Arcangeli
2010-11-30  0:38       ` Daisuke Nishimura
2010-11-30  0:38         ` Daisuke Nishimura
2010-11-30 19:01         ` Andrea Arcangeli
2010-11-30 19:01           ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 54 of 66] transparent hugepage config choice Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  6:20   ` KOSAKI Motohiro
2010-11-09  6:20     ` KOSAKI Motohiro
2010-11-09 21:11     ` Andrea Arcangeli
2010-11-09 21:11       ` Andrea Arcangeli
2010-11-14  5:07       ` KOSAKI Motohiro
2010-11-14  5:07         ` KOSAKI Motohiro
2010-11-15 15:13         ` Andrea Arcangeli
2010-11-15 15:13           ` Andrea Arcangeli
2010-11-18 16:22       ` Mel Gorman
2010-11-18 16:22         ` Mel Gorman
2010-12-09 19:04         ` Andrea Arcangeli
2010-12-09 19:04           ` Andrea Arcangeli
2010-12-14  9:45           ` Mel Gorman
2010-12-14  9:45             ` Mel Gorman
2010-12-14 16:06             ` Andrea Arcangeli
2010-12-14 16:06               ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 56 of 66] transhuge isolate_migratepages() Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:25   ` Mel Gorman
2010-11-18 16:25     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 57 of 66] avoid breaking huge pmd invariants in case of vma_adjust failures Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 58 of 66] don't allow transparent hugepage support without PSE Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 59 of 66] mmu_notifier_test_young Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 60 of 66] freeze khugepaged and ksmd Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0 Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09 10:27   ` KOSAKI Motohiro
2010-11-09 10:27     ` KOSAKI Motohiro
2010-11-09 21:49     ` Andrea Arcangeli
2010-11-09 21:49       ` Andrea Arcangeli
2010-11-18 16:31   ` Mel Gorman
2010-11-18 16:31     ` Mel Gorman
2010-12-09 19:10     ` Andrea Arcangeli
2010-12-09 19:10       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 62 of 66] disable transparent hugepages by default on small systems Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:34   ` Mel Gorman
2010-11-18 16:34     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 63 of 66] fix anon memory statistics with transparent hugepages Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 64 of 66] scale nr_rotated to balance memory pressure Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  6:16   ` KOSAKI Motohiro
2010-11-09  6:16     ` KOSAKI Motohiro
2010-11-18 19:15     ` Andrea Arcangeli
2010-11-18 19:15       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 65 of 66] transparent hugepage sysfs meminfo Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 66 of 66] add debug checks for mapcount related invariants Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:39 ` [PATCH 00 of 66] Transparent Hugepage Support #32 Mel Gorman
2010-11-18 16:39   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101118151221.GT8135@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bp@alien8.de \
    --cc=bpicco@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=chrisw@sous-sol.org \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=travis@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.